[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DOGR: Towards Versatile Visual Document Grounding and Referring

Created by
  • Haebom

Author

Yinan Zhou, Yuxin Chen, Haokun Lin, Shuyu Yang, Zhongang Qi, Chen Ma, Li Zhu, Ying Shan

Outline

In this paper, we propose a DOcument Grounding and Referring data engine (DOGR-Engine) to improve the grounding and referring capabilities of multimodal large-scale language models (MLLMs), which are still underdeveloped due to the lack of fine-grained datasets and comprehensive benchmarks in the field of visual document understanding. DOGR-Engine generates two types of high-quality fine-grained document data: multi-grain analysis data for improving text localization and recognition, and instruction-tuning data for activating the grounding and referring capabilities of MLLMs in conversations and inferences. Using the generated data, we construct DOGR-Bench, a benchmark that covers seven grounding and referring tasks in three document types (charts, posters, and PDF documents), and develop DOGR, a powerful baseline model that excels in text localization and recognition and accurately grounds and refers to important text information during conversations and inferences. As a result, it enables more fine-grained document understanding and flexible interaction paradigms.

Takeaways, Limitations

Takeaways:
We present DOGR-Bench, a high-quality, granular dataset for visual document understanding.
Development of DOGR-Engine that contributes to improving the grounding and referring functions of MLLM.
We present the DOGR model, which shows excellent performance in both text localization and recognition, grounding and referring functions.
Suggesting granular document understanding and flexible interaction paradigm possibilities.
Limitations:
Further research is needed on the generalization performance of the DOGR-Engine and DOGR models.
There is a need to expand the diversity of document types and tasks included in DOGR-Bench.
There is a need to evaluate the performance of the DOGR model in real-world applications.
👍