Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation

Created by
  • Haebom

Author

Gregor Baer, Isel Grau, Chao Zhang, Pieter Van Gorp

Outline

This paper addresses the challenges of evaluating feature attribution methods in explainable AI (XAI). While researchers typically rely on perturbation-based metrics in the absence of ground truth, recent studies have shown that such metrics can perform differently across predicted classes within the same dataset. This “class-dependent evaluation effect” raises questions about whether perturbation analysis reliably measures attribution quality, and has direct implications for the development and evaluation reliability of XAI methods. In this paper, we investigate under what conditions such class-dependent effects occur through controlled experiments using synthetic time-series data with known ground truth feature locations. After systematically varying feature types and class contrasts in a binary classification task, we compare perturbation-based degradation scores with ground truth-based precision-recall metrics using multiple attribution methods. The results show that class-dependent effects are present in both evaluation methods even in simple scenarios with temporally localized features, due to fundamental changes in feature amplitude or temporal extent. Most importantly, perturbation-based metrics and ground truth metrics often produce conflicting estimates of attribution quality across classes, and the correlation between assessment methods is weak. These results suggest that researchers should interpret perturbation-based metrics cautiously, as they may not always correspond to whether attribution correctly identifies distinguishing features. By demonstrating this discrepancy, this study points to the need to reconsider what attribution assessments actually measure and to develop more rigorous assessment methods that capture multiple dimensions of attribution quality.

Takeaways, Limitations

Takeaways:
Clearly present the limitations of evaluating XAI methods using only perturbation-based metrics.
Experimentally prove the existence of class-dependent evaluation effects and analyze their causes.
Raising questions about the reliability of existing evaluation methods and emphasizing the need to develop new evaluation methods.
The development and evaluation of XAI methods suggests the need for a more rigorous and multidimensional evaluation approach.
Limitations:
Using synthetic data limits generalizability to real-world datasets.
Additional experiments on different types of XAI methods and datasets are needed.
Absence of specific proposals for new assessment methods.
👍