Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Language Model Re-rankers are Fooled by Lexical Similarities

Created by
  • Haebom

Author

Lovisa Hagstr om, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, Alexander Junge

Outline

This paper evaluates the performance of Language Model (LM) re-rankers used to improve retrieval results in Retrieval-Augmented Generation (RAG). Six LM re-rankers are compared with the BM25 baseline on three datasets: NQ, LitQA2, and DRUID. The experimental results show that the LM re-ranker does not outperform the BM25 baseline on the DRUID dataset, and a novel separation metric based on BM25 scores is used to explain and identify the re-ranker errors due to lack of lexical similarity. We also investigate various methods to improve the performance of the LM re-ranker, but find that these methods are mainly effective only on the NQ dataset. In conclusion, this study reveals the weaknesses of the LM re-ranker and suggests the need for evaluation using more adversarial and realistic datasets.

Takeaways, Limitations

Takeaways:
We found that LM rerankers do not always outperform simple criteria like BM25, and that the performance gap varies significantly across datasets.
We propose a method to explain and diagnose the cause of LM reranking errors as a lack of lexical similarity using a new separation metric based on BM25 scores.
This suggests that further research and development is needed to improve the performance of LM reordering machines.
Limitations:
The proposed performance improvement method was effective only on a specific dataset (NQ) and is difficult to extend to a general performance improvement strategy.
Evaluation using more adversarial and realistic datasets is needed.
Further analysis is needed to determine why the DRUID dataset fails to outperform the BM25 baseline.
👍