Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Towards Visual Text Grounding of Multimodal Large Language Model

Created by
  • Haebom

Author

Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, Tong Sun

Outline

This paper addresses the limitations of multimodal large-scale language models (MLLMs) in visual text-based labeling, particularly in document images. Unlike existing benchmarks that focus on natural images, we present a new benchmark task, TRIG, that focuses on the complex layout and text content of text-rich document images, such as scanned forms or infographics. Using a novel guidance dataset containing 800 manually annotated question-answer pairs and 90,000 synthetic data points generated from four diverse datasets, we evaluate and improve MLLM's text-rich image-based labeling capabilities. Furthermore, we propose two effective TRIG methods: general guidance fine-tuning and plug-and-play efficient embeddings. Fine-tuning MLLM on synthetic datasets enhances its spatial inference and labeling capabilities.

Takeaways, Limitations

Takeaways:
We clearly highlight the challenges of visual text-based labeling of text-rich document images and attempt to address them through a new benchmark, TRIG.
We present a method for generating datasets using the OCR-LLM-Human Interaction pipeline.
We demonstrate the potential of the proposed TRIG method to improve the spatial inference and base assignment capabilities of MLLM.
By clearly revealing the limitations of existing MLLM's document image understanding ability, we suggest future research directions.
Limitations:
The size of the presented synthetic dataset (90k) may not fully reflect the diversity of the real world.
Further validation of the generalization performance of the proposed TRIG method is needed.
The amount of manual annotation data (800) may be limited.
Lack of generalization performance evaluation for different types of document images.
👍