This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation
Created by
Haebom
Author
Radhika Dua (Fred), Young Joon (Fred), Kwon, Siddhant Dogra, Daniel Freedman, Diana Ruan, Motaz Nashawaty, Danielle Rigau, Daniel Alexander Alber, Kang Zhang, Kyunghyun Cho, Eric Karl Oermann
Outline
This paper proposes ICARE, an interpretable clinical assessment framework for the secure distribution of radiology report generators (RRGs). ICARE leverages a large-scale language model agent and dynamic multiple-choice question answering (MCQA) to generate clinically meaningful questions and assess agreement between two agents, one with a correct report and one with a generated report. Answer agreement serves as an interpretable proxy for clinical precision and recall, linking the score to question-answer pairs to enable transparent and interpretable assessment. Clinical studies show that ICARE significantly outperforms existing metrics in expert judgment. Confounding analysis demonstrates sensitivity to clinical context and reproducibility, and model comparisons reveal interpretable error patterns.
Takeaways, Limitations
•
Takeaways:
◦
ICARE presents an evaluation framework for generating interpretable radiological image reports (RRGs).
◦
Shows a higher correlation with expert judgment than conventional black-box indicators.
◦
Provide transparent and interpretable assessments through question-answer pairs and score associations.
◦
Interpretable error pattern analysis possible through model comparison
•
Limitations:
◦
Further research is needed on the generalization performance of ICARE.
◦
Applicability verification is needed for various types of radiological imaging and clinical environments.
◦
Possible limitations due to reliance on large-scale language model agents