Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

Created by
  • Haebom

Author

Hongjin Qian, Zheng Liu, Chao Gao, Yankai Wang, Defu Lian, Zhicheng Dou

Outline

HawkBench is a novel benchmark for evaluating the adaptive resilience of RAG systems to meet the dynamic and diverse needs of users in real-world information retrieval scenarios. Unlike existing benchmarks that focus on specific task types (primarily factual questions) and diverse knowledge bases, HawkBench systematically categorizes a wide range of question types, including factual and evidence-based questions. It integrates multi-domain corpora across all task types to mitigate corpus bias and provides rigorous annotations for high-quality evaluation. It includes 1,600 high-quality test samples, evenly distributed across domains and task types. We evaluate representative RAG methods to analyze their performance in terms of answer quality and response latency, highlighting the need for dynamic task strategies that integrate decision-making, query interpretation, and overall knowledge understanding to improve RAG generalization.

Takeaways, Limitations

Takeaways:
We present HawkBench, a new benchmark for comprehensively evaluating the resilience of RAG systems.
Overcoming the limitations of existing benchmarks by including diverse question types and multi-domain corpora.
Emphasizes the importance of dynamic task strategies to improve the generalization of RAG systems.
Provides key benchmarks that will contribute to the advancement of RAG research.
Limitations:
Further review is needed to determine whether the benchmark's size (1,600 samples) is sufficient.
A comprehensive evaluation of the various RAG models may still be lacking.
It may not perfectly reflect actual user situations.
👍