This paper proposes MMGraphRAG to address the shortcomings of existing Retrieval-Augmented Generation (RAG) methods, namely, the lack of multimodal information utilization and the lack of consideration of logical relationships between knowledge structures and modalities. MMGraphRAG enhances visual content through a scene graph and combines it with a text-based knowledge graph to build a multimodal knowledge graph (MMKG). It uses spectral clustering to perform cross-modal entity linking and guides the generation process by retrieving context along inference paths. It achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and a clear inference path.