Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

Created by
  • Haebom

Author

Shuai Wang, Ivona Najdenkoska, Hongyi Zhu, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Outline

This paper proposes ArtRAG, a novel framework for understanding art from various perspectives (cultural, historical, and stylistic). To overcome the limitations of existing multimodal large-scale language models (MLLMs), which fail to adequately capture the nuances of art interpretation, ArtRAG utilizes an Art Contextual Knowledge Graph (ACKG) automatically generated from domain-specific text sources. The ACKG organizes entities such as artists, movements, subjects, and historical events into an interpretable graph. A multi-grain structured searcher selects relevant subgraphs and guides the generation of the MLLM. Experimental results on the SemArt and Artpedia datasets demonstrate that ArtRAG outperforms existing models, and human evaluations demonstrate that it generates consistent, insightful, and culturally rich interpretations.

Takeaways, Limitations

Takeaways:
Enables interpretation of artwork from various perspectives by utilizing domain-specific knowledge graphs.
Overcoming the limitations of existing MLLM and generating richer and more accurate descriptions of artworks.
We present a novel approach combining knowledge graphs and RAG without training.
Validated superior performance compared to existing models on SemArt and Artpedia datasets.
Limitations:
Performance may be affected by the quality and quantity of domain-specific text sources used to generate ACKG.
The ability to generate descriptions of works of a particular art movement or style may depend on the biases of the dataset.
The limitation may be that it relies on text information rather than directly utilizing the visual information itself.
The subjectivity of human evaluations may influence the results.
👍