This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness
Created by
Haebom
Author
Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath
Outline
In this paper, we propose DLaVA, a novel pipeline that does not require training to improve reliability, interpretability, and explainability in document visual question answering (VQA). DLaVA performs zero-shot answer localization by leveraging a multimodal large-scale language model (MLLM). It significantly reduces computational complexity while preserving spatial context through an innovative OCR-free approach that constructs text regions using unique bounding box IDs instead of relying on traditional iterative OCR or thought chain inference. By improving the evaluation protocol by incorporating the IoU metric and ANLS, we consider not only textual accuracy but also spatial accuracy, thereby reducing the risk of AI hallucination and improving reliability. Experimental results on benchmark datasets demonstrate competitive performance compared to state-of-the-art techniques, significantly reducing computational complexity, and improving accuracy and reliability in high-risk applications. The code and dataset used in DLaVA are available at https://github.com/ahmad-shirazi/AnnotMLLM .