Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models

Created by
  • Haebom

Author

Juraj Vladika, Mahdi Dhaini, Florian Matthes

Outline

This paper addresses the potential of large-scale language models (LLMs) to improve healthcare by supporting medical research and physicians. However, their reliance on static training data poses a significant risk when medical recommendations evolve in response to new research and developments. LLMs may provide harmful advice or fail clinical reasoning tasks if they retain outdated medical knowledge. To investigate this issue, we present two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Dataset evaluations on eight leading LLMs reveal a consistent reliance on outdated knowledge across all models. Furthermore, we analyze the impact of obsolete pretraining data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the foundation for developing more up-to-date and reliable medical AI systems.

Takeaways, Limitations

Takeaways:
Clearly highlights the problem of dependence on outdated medical knowledge when applying LLMs to the medical field.
We present new QA datasets (MedRevQA, MedChangeQA) for evaluating outdated knowledge problems.
Experimentally demonstrating consistent reliance on outdated knowledge across various LLMs.
Analysis of the causes of the problem of outdated knowledge and suggestions for mitigation measures.
Laying the foundation for developing more reliable medical AI systems.
Limitations:
The size of the presented dataset needs to be expanded for further research.
The types of LLMs used in the analysis are limited.
Further verification of the effectiveness of the proposed mitigation measures is needed.
👍