Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Harnessing PDF Data for Improving Japanese Large Multimodal Models

Created by
  • Haebom

Author

Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa

Outline

In this paper, we propose a method to utilize underutilized Japanese PDF data to solve the problem of limited performance of Japanese large-scale multimodal models (LMMs) due to the lack of high-quality Japanese training data. Instead of relying on existing translated English datasets, we build a pipeline to automatically extract image-text pairs from PDFs using a pre-trained model, and generate additional pointer data from the extracted data. Using the generated data, we train a Japanese LMM, and achieve a performance improvement of 2.1% to 13.8% on the Japanese LMM benchmark, Heron-Bench. We verify the utility of PDF data through performance analysis according to model size and language model.

Takeaways, Limitations

Takeaways:
We present a new data source (PDF) to improve the performance of Japanese LMM.
We developed a fully automated data extraction pipeline to provide an efficient data construction method.
We experimentally demonstrated the effectiveness of Japanese LMM training using PDF data.
We analyzed how the effectiveness of PDF data differs depending on model size and language model.
Limitations:
Model performance may be affected by the quality and diversity of PDF data.
Verification and supplementation of errors that may occur during the automated data extraction process are required.
Additional evaluations on benchmarks other than Heron-Bench are needed.
There are aspects that depend on the performance of the pre-trained model used.
👍