Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Multimodal Iterative RAG for Knowledge Visual Question Answering

Created by
  • Haebom

Author

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee

Outline

This paper proposes a multimodal iterative retrieval augmented generation (MI-RAG) framework to address the performance limitations of multimodal large-scale language models (MLLMs) for knowledge-intensive visual questions requiring external knowledge. MI-RAG leverages inferences to enhance retrieval and updates inferences across multiple modalities based on newly discovered knowledge. At each iteration, it dynamically generates multiple queries using the accumulated inference history, performing joint searches across heterogeneous knowledge bases that include both visually based and textual knowledge. Newly acquired knowledge is integrated into the inference history to iteratively improve comprehension. On benchmarks such as Encyclopedic VQA, InfoSeek, and OK-VQA, MI-RAG significantly improves retrieval recall and answer accuracy, presenting a scalable approach for constructive inference in knowledge-intensive VQA.

Takeaways, Limitations

Takeaways:
Contributing to improving the performance of knowledge-intensive visual question answering in multimodal large-scale language models.
Knowledge integration through iterative search and inference enables more accurate and comprehensive answers.
Presenting an extensible framework that effectively leverages knowledge from various modalities.
Experimentally verified performance improvements on benchmarks such as Encyclopedic VQA, InfoSeek, and OK-VQA.
Limitations:
Lack of analysis of the computational cost and processing time of the proposed MI-RAG framework.
There is a need to evaluate generalization performance for various types of knowledge bases.
Further research is needed on the possibility of error propagation and ways to ensure transparency in the inference process.
There is a possibility of bias towards certain types of knowledge bases.
👍