Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs

Created by
  • Haebom

Author

Haotian Chen, Qingqing Long, Meng Xiao, Xiao Luo, Wei Ju, Chengrui Wang, Xuezhi Wang, Yuanchun Zhou, Hengshu Zhu

Outline

This paper presents SciRerankBench, a novel benchmark for evaluating rerankers within the Two-Stage Retrieval Augmented Generative Large Language Model (RAG-LLM) system for scientific literature question answering. It highlights the critical role of rerankers in scientific fields, where subtle differences in terminology can significantly impact the accuracy of answers. SciRerankBench spans five scientific domains and develops three types of question-context-answer (QCA) pairs: Noisy Contexts, Semantically Similar but Logically Irrelevant Contexts, and Counterfactual Contexts, to rigorously evaluate reranker performance in terms of noise robustness, relevance disambiguation, and factual consistency. Through a systematic evaluation of 13 rerankers and five LLM families, we provide insight into the strengths and limitations of each reranker, emphasizing that SciRerankBench is the first benchmark for evaluating rerankers within RAG-LLM.

Takeaways, Limitations

Takeaways:
We highlight the importance of rerankers within the RAG-LLM system and provide SciRerankBench, the first specialized benchmark for this purpose.
A systematic evaluation of various rerankers and LLMs provides an in-depth understanding of the strengths and limitations of each reranker.
SciRerankBench provides valuable guidance for future reranker development.
It can contribute to improving the performance of scientific literature question-answering.
Limitations:
The number of scientific fields, rerankers, and LLMs currently included in the benchmark may be limited.
A detailed explanation of how SciRerankBench generates QCA pairs may be lacking.
There is a need to expand the benchmark to include more diverse types of questions and contexts.
There may be a lack of discussion about the limitations of evaluation metrics and ways to improve them.
👍