Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

Created by
  • Haebom

Author

Atharva Kulkarni, Yuan Zhang, Joel Ruben, Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, Hong Yu

Hallucination Detection Metrics: A Large-Scale Empirical Evaluation

Outline

This paper explores the measurement of hallucinations in language models. Noting that current hallucination detection metrics have limitations in terms of reliability and generalizability, we conducted a large-scale evaluation of six hallucination detection metrics using four datasets, 37 different language models, and five decoding methods. The results show that existing metrics do not match human judgment, exhibit a myopic approach to the problem, and exhibit inconsistent performance improvements as model size increases. On the positive side, LLM-based evaluation, such as GPT-4, yielded the best results, and mode-exploratory decoding was shown to be effective in reducing hallucinations.

Takeaways, Limitations

Current hallucination detection indicators often do not match human judgment.
Hallucination detection indicators show a myopic approach to the problem.
Hallucination detection performance does not consistently improve with increasing model size.
LLM-based evaluation (especially GPT-4) shows excellent performance.
Mode-search decoding is effective in reducing hallucinations.
Emphasizes the need for developing more robust hallucination detection indicators and hallucination mitigation strategies.
👍