Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Created by
  • Haebom

Author

Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth Sastry

Outline

This paper addresses the problem of hallucinations in medically relevant large-scale language models (LLMs) responding to patient questions. Unlike previous studies that focused on assessing LLMs' medical knowledge through standardized medical exam questions, this study analyzes hallucinations in LLMs' responses to medical questions from real patients. To achieve this, we present MedHalu, a new benchmark comprised of various medical topics and hallucination responses generated by LLMs, and we annotate hallucination types and text segments in detail. Furthermore, we propose MedHaluDetect, a comprehensive framework for evaluating LLMs' hallucination detection capabilities, and study the vulnerability of three groups of individuals to medical hallucinations: medical professionals, LLMs, and laypeople. Our results show that LLMs perform significantly worse than medical professionals and, in some cases, laypeople in detecting hallucinations. We propose an expert-involved approach that integrates expert inferences into LLM inputs, thereby improving the hallucination detection performance of LLMs (e.g., a 6.3% improvement in the macro-F1 score for GPT-4).

Takeaways, Limitations

Takeaways:
We present a medical hallucination benchmark MedHalu and an evaluation framework MedHaluDetect based on real patient questions.
Empirically, we demonstrate that LLMs have significantly lower medical hallucination detection abilities than medical professionals and the general public.
Suggesting the possibility of improving the hallucination detection performance of LLM through expert participation.
Providing important Takeaways to ensure the safety and reliability of LLM-based medical information provision system.
Limitations:
Data size and diversity limitations of the MedHalu benchmark.
Limitations on the types of LLM models involved in the study.
Further research is needed to determine the generalizability of expert participation methods and their applicability to real-world medical settings.
A more detailed analysis of the different types of hallucinations and their severity is needed.
👍