Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Created by
  • Haebom

Author

Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath

Outline

In this paper, we propose DLaVA, a novel pipeline that does not require training to improve reliability, interpretability, and explainability in document visual question answering (VQA). DLaVA performs zero-shot answer localization by leveraging a multimodal large-scale language model (MLLM). It significantly reduces computational complexity while preserving spatial context through an innovative OCR-free approach that constructs text regions using unique bounding box IDs instead of relying on traditional iterative OCR or thought chain inference. By improving the evaluation protocol by incorporating the IoU metric and ANLS, we consider not only textual accuracy but also spatial accuracy, thereby reducing the risk of AI hallucination and improving reliability. Experimental results on benchmark datasets demonstrate competitive performance compared to state-of-the-art techniques, significantly reducing computational complexity, and improving accuracy and reliability in high-risk applications. The code and dataset used in DLaVA are available at https://github.com/ahmad-shirazi/AnnotMLLM .

Takeaways, Limitations

Takeaways:
We significantly reduced computational complexity through an OCR-free approach.
By adding the IoU metric to include spatial accuracy in the evaluation, we reduced the risk of AI hallucination and increased reliability.
We reduced our reliance on training data by leveraging MLLM to enable zero-shot answer location verification.
Achieved competitive performance compared to state-of-the-art technologies.
Limitations:
Further studies are needed to investigate the generalization performance of the proposed method.
You need to analyze the potential for performance degradation for specific types of documents or questions.
There is a lack of comparative performance analysis for different MLLMs.
Additional performance validation in real application environments is required.
👍