This paper proposes a hybrid model combining a convolutional neural network (CNN) and a vision transformer (ViT) for interpretability in medical image analysis. To address the interpretability challenges of existing hybrid models, we developed a fully convolutional CNN-transformer architecture that considered interpretability from the design stage. This model was applied to retinal disease detection, achieving superior predictive performance compared to existing black-box and interpretable models. It also generates class-specific sparse evidence maps through a single forward pass. Reproducibility was ensured through open source code.