Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift?

Created by
  • Haebom

Author

Alexander Herzog, Aliai Eusebi, Lorenzo Cavallaro

Outline

This paper raises the question of whether the performance metrics of state-of-the-art drift-adaptive malware classifiers, while promising, translate into real-world operational reliability. Existing evaluation approaches focus only on baseline performance metrics, overlooking confidence-error alignment and operational reliability. While TESSERACT has established the importance of temporal evaluation, this paper takes a complementary approach by investigating whether malware classifiers maintain reliable and stable confidence estimates under distributional changes, and explores the tension between scientific progress and practical impact when they do not. We therefore propose the AURORA framework for evaluating malware classifiers based on confidence quality and operational resilience. AURORA assesses the reliability of estimates by verifying the confidence profile of a given model. Unreliable confidence estimates can compromise operational reliability, waste valuable annotation budget on uninformative samples for active learning, and miss instances that are prone to error in selective classification. AURORA is complemented by a set of metrics that go beyond single-point performance for a more comprehensive assessment of operational reliability over the temporal evaluation period. The vulnerability of state-of-the-art frameworks on various drift datasets suggests the need to start from scratch.

Takeaways, Limitations

Takeaways:
Presentation of a new framework AURORA for evaluating the reliability of real-world malware classifiers
Proposing a comprehensive evaluation method that considers reliability quality and operational resilience
Beyond temporal evaluation, we emphasize the importance of stability of reliability estimates under changing distributions.
Point out the __T176677_____ of the existing evaluation method and suggest the need to establish more realistic evaluation criteria.
Limitations:
Additional experiments and verification are needed for practical application and performance of the AURORA framework.
Need to examine generalizability across different types of malware and drift scenarios
Possible lack of clear guidelines for the interpretation and use of proposed indicators.
Mentioned the need to start from scratch, but lacked specific direction
👍