This paper aims to improve the performance of knowledge-intensive visual question answering (VQA) using multimodal large-scale language models (MLLMs). To overcome the limitations of conventional single-pass retrieval augmented generation (RAG) methods, we propose a multimodal iterative RAG framework (MI-RAG), which leverages inference to improve retrieval and integrates knowledge synthesis. MI-RAG iteratively generates multiple queries, retrieves diverse knowledge, and synthesizes them to deepen understanding. Benchmark experiments on Encyclopedic VQA, InfoSeek, and OK-VQA demonstrate that MI-RAG significantly improves both retrieval and response accuracy.