Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

See then Tell: Enhancing Key Information Extraction with Vision Grounding

Created by
  • Haebom

Author

Shuhang Liu, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Qing Wang, Jianshu Zhang, Chenyu Liu

Outline

This paper emphasizes the importance of understanding visually rich documents (text, complex layouts, and image integration) and points out the limitations of existing Key Information Extraction (KIE) methods (delay due to reliance on OCR, high computational costs, and errors). To overcome these limitations, we present STNet, a novel end-to-end model that extracts text directly from images without OCR. STNet uses special tokens to observe (see) image regions relevant to a question and, based on these, provides accurate answers and visual grounding (tell). To improve the model's performance, we leverage GPT-4 to build the TVG (TableQA with Vision Grounding) dataset, which contains text-based question-answering (QA) pairs and accurate visual grounding. Experimental results demonstrate state-of-the-art performance on publicly available datasets such as CORD, SROIE, and DocVQA. The code will also be made public.

Takeaways, Limitations

Takeaways:
A new KIE approach that breaks free from OCR dependence is presented.
Improving accuracy by providing visual evidence in image-based question-answering.
Building and releasing high-quality datasets using GPT-4
Achieving SOTA performance on various public datasets
Increasing research reproducibility and scalability through code disclosure
Limitations:
Further validation of the scale and generalization performance of the TVG dataset is needed.
Need to evaluate the model's generalization performance for complex layouts or various image types.
Further analysis of STNet's computational cost and efficiency is needed.
👍