[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Created by
  • Haebom

Author

Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim

Outline

This paper addresses the limitations of automated evaluation metrics of the Retrieval Augmented Generation (RAG) approach, which has emerged to ensure factual accuracy in medical question answering (QA) systems. Since existing automated evaluation metrics underperform in clinical and conversational use cases, we propose a new automated and scalable evaluation metric called ASTRID to overcome the limitations of expensive and non-scalable manual evaluation. ASTRID consists of three metrics: contextual relevance (CR), rejection accuracy (RA), and conversational consistency (CF). In particular, CF is designed to evaluate the fidelity of responses to the knowledge base without penalizing conversational elements. We validate ASTRID using a post-cataract surgery patient questionnaire dataset and show that CF better predicts human evaluation in conversational use cases than existing metrics. We also show that ASTRID is consistent with clinician evaluations of inappropriate, harmful, and unhelpful responses, and we verify that the three metrics are in good agreement with human evaluations using various LLMs. Finally, we make the prompts and datasets used in the experiments publicly available, providing valuable resources for further research and development.

Takeaways, Limitations

Takeaways:
Presentation of new indicators ASTRID (CR, RA, CF) for automatic evaluation of medical QA systems
Proposing and validating a new metric CF for evaluating model response fidelity in conversational contexts
ASTRID shows high correlation with human assessments, suggesting potential for use in automated assessment pipelines
Supporting follow-up research through disclosure of experimental datasets and prompts
Limitations:
Further research is needed on the generalizability of ASTRID (verification of applicability to various medical fields and diseases)
Further explanation and improvement is needed regarding the definition and calculation method of the CF indicator.
Additional validation using large and diverse datasets is needed.
Limitations on generalized performance evaluation due to the use of datasets focused on a specific medical field (cataract surgery).
👍