[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation

Created by
  • Haebom

Author

Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu

Outline

This paper presents a novel benchmark, HoH, to evaluate the impact of stale information in knowledge bases in Retrieval-Augmented Generation (RAG) models. While previous studies have focused on the integration of up-to-date information, the impact of coexistence of stale information on RAG performance has not been sufficiently addressed. HoH efficiently generates a large-scale QA dataset that accurately captures temporal knowledge changes in real-world facts by leveraging a token-level difference algorithm and an LLM pipeline. Experimental results show that stale information degrades RAG performance in two ways: (1) decreasing accuracy (by distracting the model from correct information) and (2) generating potentially dangerous outputs (despite the presence of up-to-date information). These results highlight the need for innovative solutions to address temporal challenges in RAG. Code and data are available at https://github.com/0russwest0/HoH .

Takeaways, Limitations

Takeaways:
We present the first benchmark HoH to systematically evaluate the negative impact of outdated information in RAG systems.
We experimentally demonstrate that outdated information can reduce the accuracy of RAG and even cause it to produce harmful output.
Both the retrieval and generation stages of RAG reveal difficulties in processing old information.
Presenting a new research direction for solving the temporal challenges of RAG.
Limitations:
The HoH benchmark may be limited to certain types of outdated information and RAG models.
It may not perfectly reflect the complex situations of the real world.
The lack of a proposed solution. The benchmark only raises the issue, but does not provide a solution.
👍