This paper addresses the issues in the evaluation of machine learning-based decision support systems, which are increasingly used in medical practice, and proposes a novel evaluation framework to address them. Existing evaluation metrics such as accuracy or AUC-ROC do not adequately reflect important clinical priorities such as calibration, robustness to distributional changes, and sensitivity to asymmetric error costs. Therefore, in this paper, we present a principled and practical evaluation framework for selecting calibrated threshold classifiers that explicitly considers uncertainty in class emergence probabilities and domain-specific asymmetric costs frequently encountered in clinical settings. In particular, we derive a calibrated cross-entropy (log score) variant that averages cost-weighted performance over a clinically relevant range of class balances, based on appropriate scoring rule theory centered on the Schervish representation. The proposed evaluation scheme is designed to prioritize models that are easy to apply, sensitive to clinical deployment conditions, and robust to calibrated and real-world changes.