This paper presents the Explainable Vision Mamba (EVM-Fusion) architecture to improve the accuracy, interpretability, and generalizability of medical image classification. EVM-Fusion employs a multi-pass design utilizing DenseNet and U-Net-based paths, each enhanced by a Vision Mamba (Vim) module. Various features are dynamically integrated through a two-step fusion process involving cross-modal attention and an iterative Neural Algorithm Fusion (NAF) block. Intrinsic explainability is internalized through path-specific spatial attention, Vim Δ-value maps, original feature SE-attention, and cross-modal attention weights. Experimental results on a diverse nine-class, multi-institutional medical image dataset demonstrate robust classification performance, achieving 99.75% test accuracy, highlighting the potential of reliable AI in medical diagnosis.