This paper addresses the challenges of evaluating feature attribution methods in explainable AI (XAI). While researchers typically rely on perturbation-based metrics in the absence of ground truth, recent studies have shown that such metrics can perform differently across predicted classes within the same dataset. This “class-dependent evaluation effect” raises questions about whether perturbation analysis reliably measures attribution quality, and has direct implications for the development and evaluation reliability of XAI methods. In this paper, we investigate under what conditions such class-dependent effects occur through controlled experiments using synthetic time-series data with known ground truth feature locations. After systematically varying feature types and class contrasts in a binary classification task, we compare perturbation-based degradation scores with ground truth-based precision-recall metrics using multiple attribution methods. The results show that class-dependent effects are present in both evaluation methods even in simple scenarios with temporally localized features, due to fundamental changes in feature amplitude or temporal extent. Most importantly, perturbation-based metrics and ground truth metrics often produce conflicting estimates of attribution quality across classes, and the correlation between assessment methods is weak. These results suggest that researchers should interpret perturbation-based metrics cautiously, as they may not always correspond to whether attribution correctly identifies distinguishing features. By demonstrating this discrepancy, this study points to the need to reconsider what attribution assessments actually measure and to develop more rigorous assessment methods that capture multiple dimensions of attribution quality.