Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs

Created by
  • Haebom

Author

Gerardo A. Flores, Alyssa H. Smith, Julia A. Fukuyama, Ashia C. Wilson

Outline

This paper addresses the issues in the evaluation of machine learning-based decision support systems, which are increasingly used in medical practice, and proposes a novel evaluation framework to address them. Existing evaluation metrics such as accuracy or AUC-ROC do not adequately reflect important clinical priorities such as calibration, robustness to distributional changes, and sensitivity to asymmetric error costs. Therefore, in this paper, we present a principled and practical evaluation framework for selecting calibrated threshold classifiers that explicitly considers uncertainty in class emergence probabilities and domain-specific asymmetric costs frequently encountered in clinical settings. In particular, we derive a calibrated cross-entropy (log score) variant that averages cost-weighted performance over a clinically relevant range of class balances, based on appropriate scoring rule theory centered on the Schervish representation. The proposed evaluation scheme is designed to prioritize models that are easy to apply, sensitive to clinical deployment conditions, and robust to calibrated and real-world changes.

Takeaways, Limitations

Takeaways:
Overcoming the limitations of existing indicators in evaluating machine learning models in the medical field and presenting a new evaluation framework that reflects clinical priorities
Allows for more realistic model evaluation by taking into account class imbalance and asymmetric costs
Simple and effective evaluation can be performed by utilizing corrected cross entropy.
Predict model performance in real clinical environments and select robust models
Limitations:
Further validation of the proposed framework in practical clinical applications is needed.
Further research is needed to determine generalizability across different clinical settings and disease types.
Subjectivity and domain knowledge dependence on the setting of the cost function
Additional explanation is needed on the interpretation and understanding of the new assessment indicators.
👍