Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval

Created by
  • Haebom

Author

Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, Jiang Bian

Outline

This paper presents an improved visual-verbal search augmented generation (RAG) approach for solving the knowledge-based visual question answering (KB-VQA) problem. Existing visual-verbal RAG systems struggle to achieve effective multimodal retrieval due to the diverse modalities and granularity of knowledge. To address these challenges, we propose a multimodal RAG system that proceeds in multiple stages, from coarse to fine. First, we perform multimodal retrieval tailored to the knowledge granularity, then we leverage multimodal information to select top-level entities. Finally, we use text reranking to select the most appropriate granular information for generation. We demonstrate that our RAG system contributes to the advancement of KB-VQA systems by achieving state-of-the-art retrieval performance and competitive answering results on the InfoSeek and Encyclopedic-VQA benchmarks.

Takeaways, Limitations

Takeaways:
We improved the efficiency of solving the KB-VQA problem through a multi-modal RAG system.
We effectively leveraged different levels of knowledge granularity through a multi-stage approach, from coarse to fine-grained searches.
Multi-modal information fusion and re-ranking enable more accurate information retrieval and question-answering.
Achieved state-of-the-art performance on InfoSeek and Encyclopedic-VQA benchmarks.
Limitations:
Further research is needed to investigate the generality of the proposed method and its applicability to other KB-VQA datasets.
There is a potential for increased computational costs due to the complexity of the multi-modal information fusion process.
Hyperparameter settings that are optimized for a particular dataset may result in poor performance on other datasets.
👍