This paper proposes a multimodal iterative retrieval augmented generation (MI-RAG) framework to address the performance limitations of multimodal large-scale language models (MLLMs) for knowledge-intensive visual questions requiring external knowledge. MI-RAG leverages inferences to enhance retrieval and updates inferences across multiple modalities based on newly discovered knowledge. At each iteration, it dynamically generates multiple queries using the accumulated inference history, performing joint searches across heterogeneous knowledge bases that include both visually based and textual knowledge. Newly acquired knowledge is integrated into the inference history to iteratively improve comprehension. On benchmarks such as Encyclopedic VQA, InfoSeek, and OK-VQA, MI-RAG significantly improves retrieval recall and answer accuracy, presenting a scalable approach for constructive inference in knowledge-intensive VQA.