Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

작성자
  • Haebom

Author

Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur

Outline

This paper highlights that despite advances in multimodal large-scale language models (LLMs), which have significantly improved the ability to analyze and understand complex data inputs across multiple modalities, long document processing remains an untapped area due to a lack of appropriate benchmarks. To address this, this paper presents Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on visually complex long documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text-and-image "needles" at various depths within the document to challenge the retrieval capabilities of VLMs. It comprises 400 document variants and a total of 8,250 questions, supporting an objective and automated evaluation framework. This paper details the construction and characteristics of the Document Haystack dataset, presents results from key VLMs, and discusses potential research directions in this area.

Takeaways, Limitations

Takeaways:
We present Document Haystack, a new benchmark for evaluating VLM performance on long, visually complex documents.
Comprehensive evaluation of VLM's search capabilities, including documents of varying length and complexity.
Improving reproducibility and comparability of research by providing an objective and automated evaluation framework.
Contribute to the direction and development of future VLM research.
Limitations:
The need for further expansion of the Document Haystack dataset in the future.
Generalization performance evaluation is needed for various types of visual information and document structures.
Further research is needed on its relevance and applicability to real-world application scenarios.
👍