Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation

Created by
  • Haebom

Author

Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu

Outline

This paper systematically analyzes the Retrieval Augmented Generation (RAG) pipeline for improving the performance of large-scale visual language models (LVLMs). LVLMs suffer from limitations such as static training data, hallucinations, and the inability to verify up-to-date external evidence. RAG mitigates these issues by accessing an external knowledge database. This paper individually examines the retrieval phase (modality configuration and retrieval strategy), the reranking phase (positional bias mitigation and relevant evidence improvement strategy), and the generation phase (how to integrate retrieved candidates). We propose a self-reflective agent framework for integrating reranking and generation. We achieve an average performance improvement of 5% without fine-tuning.

Takeaways, Limitations

Takeaways:
We present the first systematic analysis of the RAG pipeline in LVLMs.
We provide optimal strategies for each stage of search, re-ranking, and generation.
Drive performance improvements through a self-reflection-based integrated agent framework.
Significant performance improvements (average 5%) are achieved without fine-tuning.
Limitations:
Since these results are for specific LVLMs and datasets, further research is needed to determine their generalizability.
The scalability of the proposed agent framework and its potential for various application areas need to be evaluated.
Quantitative analysis of the alleviation of hallucinations is lacking.
There may be a lack of detailed analysis of the impact of the quality and size of the knowledge database used on performance.
👍