This paper proposes an interpretable hybrid model for medical image analysis that combines the local feature extraction capabilities of CNNs with the global dependency capture capabilities of ViT. To overcome the interpretability challenges of existing hybrid models, we developed a fully convolutional CNN-Transformer architecture that considered interpretability from the design stage and applied it to retinal disease detection. The proposed model outperforms existing black-box and interpretable models in predictive performance and generates class-specific sparse evidence maps in a single pass. The code is available on GitHub.