Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Disentangling Reasoning and Knowledge in Medical Large Language Models

Created by
  • Haebom

Author

Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

Outline

This paper presents a novel methodology to classify existing medical QA benchmarks into inference-driven and knowledge-driven subsets to evaluate the medical reasoning ability of large-scale language models (LLMs). Using the PubMedBERT classifier, we analyze 11 biomedical QA benchmarks and find that only 32.8% of questions require complex inference. We evaluate several biomedical and general domain models to identify differences in performance on knowledge-driven and inference-driven problems, and show that biomedical models are particularly vulnerable to incorrect initial inferences. To overcome these limitations, we train the BioMed-R1 model using fine-tuning and reinforcement learning with inference-driven data, and it achieves the best performance among similarly sized models. For further performance enhancement, we present training using adversarial and backtracking scenarios, incorporating clinical case reports.

Takeaways, Limitations

Takeaways:
A new benchmarking method for assessing medical reasoning ability (classification of inference-centric and knowledge-centric subsets)
Clarifies the performance gap between knowledge and reasoning ability in Biomedical LLM
Revealing the vulnerability to incorrect initial inference and presenting the BioMed-R1 model to improve it
Future research directions include the use of clinical case reports and training based on adversarial/backtracking scenarios.
Limitations:
The accuracy of the PubMedBERT classifier (81%) is not perfect, so there is a possibility of classification errors.
The types and scale of models used in the current evaluation may be limited.
Despite the improved performance of the BioMed-R1 model, it may still fall short of perfect medical inference capabilities
Further research is needed on clinical applicability
👍