Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

Created by
  • Haebom

Author

Zhengxuan Zhang, Yin Wu, Yuyu Luo, Nan Tang

Outline

In this paper, we propose a KB-VQA approach based on retrieval augmentation generation (RAG) that utilizes external knowledge bases (KBs) to address the problem that state-of-the-art multimodal large language models (MLLMs) have difficulty accessing domain-specific or up-to-date knowledge in visual question answering (VQA) tasks. To address the problem of image information loss in existing single-modal retrieval techniques, we propose a knowledge unit retrieval augmentation generation (KU-RAG) framework that structurally constructs fine-grained knowledge units composed of various forms of multimodal data fragments such as text fragments and object images and integrates them with MLLM. KU-RAG ensures accurate retrieval of relevant knowledge and enhances inference capability through knowledge modification chains. Experimental results show that the proposed method outperforms existing KB-VQA methods by an average of 3% and up to 11% on four benchmarks.

Takeaways, Limitations

Takeaways:
Presentation of effective knowledge search and utilization method through structural organization and management of detailed knowledge units
Improving VQA performance and strengthening inference capability of MLLM through KU-RAG framework
Verification of performance superiority over existing methods in various benchmarks
Limitations:
Further research is needed on the scalability and generalization performance of the proposed framework.
Possible degradation of generalization performance due to the use of a knowledge base biased towards a specific domain.
Problems with increasing complexity and computational cost of knowledge modification chains
👍