Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Created by
  • Haebom

Author

Xueyao Wan, Hang Yu

Outline

This paper proposes MMGraphRAG to address the shortcomings of existing Retrieval-Augmented Generation (RAG) methods, namely, the lack of multimodal information utilization and the lack of consideration of logical relationships between knowledge structures and modalities. MMGraphRAG enhances visual content through a scene graph and combines it with a text-based knowledge graph to build a multimodal knowledge graph (MMKG). It uses spectral clustering to perform cross-modal entity linking and guides the generation process by retrieving context along inference paths. It achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and a clear inference path.

Takeaways, Limitations

Takeaways:
Effectively leveraging multi-modal information to improve RAG performance.
Generate more accurate and richer information by considering logical connections between visual and textual information.
We present a novel RAG framework leveraging scene graphs and multimodal knowledge graphs.
Achieves state-of-the-art performance on DocBench and MMLongBench datasets.
Provides strong domain adaptability and clear inference paths.
Limitations:
Further validation is needed to determine whether the existing RAG method completely overcomes the Limitations limitation of requiring large-scale learning for specific tasks.
Analysis of the computational cost and efficiency of MMKG construction and spectral clustering is needed.
There is a need to evaluate generalization performance on various types of multimodal data.
👍