This paper raises the question of whether the performance metrics of state-of-the-art drift-adaptive malware classifiers, while promising, translate into real-world operational reliability. Existing evaluation approaches focus only on baseline performance metrics, overlooking confidence-error alignment and operational reliability. While TESSERACT has established the importance of temporal evaluation, this paper takes a complementary approach by investigating whether malware classifiers maintain reliable and stable confidence estimates under distributional changes, and explores the tension between scientific progress and practical impact when they do not. We therefore propose the AURORA framework for evaluating malware classifiers based on confidence quality and operational resilience. AURORA assesses the reliability of estimates by verifying the confidence profile of a given model. Unreliable confidence estimates can compromise operational reliability, waste valuable annotation budget on uninformative samples for active learning, and miss instances that are prone to error in selective classification. AURORA is complemented by a set of metrics that go beyond single-point performance for a more comprehensive assessment of operational reliability over the temporal evaluation period. The vulnerability of state-of-the-art frameworks on various drift datasets suggests the need to start from scratch.