This paper systematically analyzes the Retrieval Augmented Generation (RAG) pipeline for improving the performance of large-scale visual language models (LVLMs). LVLMs suffer from limitations such as static training data, hallucinations, and the inability to verify up-to-date external evidence. RAG mitigates these issues by accessing an external knowledge database. This paper individually examines the retrieval phase (modality configuration and retrieval strategy), the reranking phase (positional bias mitigation and relevant evidence improvement strategy), and the generation phase (how to integrate retrieved candidates). We propose a self-reflective agent framework for integrating reranking and generation. We achieve an average performance improvement of 5% without fine-tuning.