This paper presents a framework to address the "black box" nature of deep learning models, which limits the practical adoption of deep learning in high-stakes applications such as dental evaluation. Using the performance gap observed in automatic staging of mandibular second molars (tooth 37) and third molars (tooth 38) as a case study, we propose a framework combining a convolutional autoencoder (AE) and a vision transformer (ViT). This framework improves classification accuracy for both teeth compared to the baseline ViT model, increasing it from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Beyond the performance gains, analysis of the AE's latent space metrics and image reconstruction reveals that the performance gap is data-driven, with the high intraclass morphological variability of the tooth 38 dataset being a key limitation. It highlights the inadequacy of relying on a single interpretation method, such as attention maps, and provides a powerful tool to support expert decision-making by improving accuracy and providing sources of model uncertainty.