Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DOGR: Towards Versatile Visual Document Grounding and Referring

Created by
  • Haebom

Author

Yinan Zhou, Yuxin Chen, Haokun Lin, Yichen Wu, Shuyu Yang, Zhongang Qi, Chen Ma, Li Zhu, Ying Shan

Outline

This paper highlights the insufficient development of grounding and referring capabilities of multimodal large-scale language models (MLLMs) for fine-grained understanding and flexible user interaction in the field of visual document understanding. To address this, we propose the DOcument Grounding and Referring data engine (DOGR-Engine). DOGR-Engine generates two types of high-quality, fine-grained document data: (1) multi-particle analysis data for improving text localization and recognition, and (2) instruction-tuning data to enhance the grounding and referring capabilities of MLLMs in conversation and inference. Based on this data, we build DOGR-Bench, a benchmark encompassing seven grounding and referring tasks across three document types (charts, posters, and PDF documents). Leveraging the generated data, we develop DOGR, a robust baseline model that excels at text localization and recognition and accurately grounds and refers to important textual information during conversation and inference. DOGR advances document understanding to a more fine-grained level and enables flexible interaction paradigms.

Takeaways, Limitations

Takeaways:
We present a novel data engine and benchmark that contributes to improving the visual document understanding capabilities of multimodal large-scale language models.
We present a new baseline model for fine-grained document understanding.
We present a text localization and recognition technique with improved grounding and referring capabilities.
It presents a more flexible and efficient user-document interaction paradigm.
Limitations:
Further evaluation of the generalization performance of the DOGR-Engine and DOGR models is needed.
Further validation of scalability across different document types and complexities is required.
The types of tasks currently included in the benchmark may be limited. We need to increase the comprehensiveness of the benchmark by adding a wider variety of tasks.
👍