In this paper, we propose a KB-VQA approach based on retrieval augmentation generation (RAG) that utilizes external knowledge bases (KBs) to address the problem that state-of-the-art multimodal large language models (MLLMs) have difficulty accessing domain-specific or up-to-date knowledge in visual question answering (VQA) tasks. To address the problem of image information loss in existing single-modal retrieval techniques, we propose a knowledge unit retrieval augmentation generation (KU-RAG) framework that structurally constructs fine-grained knowledge units composed of various forms of multimodal data fragments such as text fragments and object images and integrates them with MLLM. KU-RAG ensures accurate retrieval of relevant knowledge and enhances inference capability through knowledge modification chains. Experimental results show that the proposed method outperforms existing KB-VQA methods by an average of 3% and up to 11% on four benchmarks.