Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Created by
  • Haebom

Author

Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma

KGQAGen: LLM-in-the-loop Framework for Knowledge Graph Question Answering

Outline

This paper identifies quality issues in benchmarks for evaluating Knowledge Graph Question Answering (KGQA) systems and proposes KGQAGen, an LLM-based framework to address these issues. To address the low accuracy (57%) of existing KGQA benchmarks, KGQAGen combines a structured knowledge base, LLM-based generation, and symbolic verification to generate challenging and verifiable QA instances. Using KGQAGen, we build KGQAGen-10k, a 10,000-item benchmark based on Wikidata, and evaluate various KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle with this benchmark, exposing the limitations of existing models.

Takeaways, Limitations

Takeaways:
It raises the seriousness of the quality issues of the KGQA benchmark and emphasizes the need for accurate and challenging benchmarks.
We propose KGQAGen, a new KGQA benchmark generation framework utilizing LLM, which has the potential to address the problems of existing benchmarks and improve evaluation accuracy.
The KGQAGen-10k benchmark reveals the performance limitations of the KG-RAG model and suggests future model development directions.
Limitations:
Due to its reliance on LLM, bias or errors in LLM may affect the benchmark.
Information about the specific implementation and algorithmic details of KGQAGen may be lacking (only limited information is provided in the paper abstract).
The KGQAGen-10k benchmark is limited to a specific knowledge graph (Wikidata), so further research may be needed to determine its generalizability to other knowledge graphs.
👍