This paper proposes a novel framework to address the challenges of detecting Mild Cognitive Impairment (MCI) through image descriptions in multilingual and multi-image environments. Unlike previous studies that primarily focused on single-image descriptions for English speakers, this paper considers multilingual users and multiple images and presents three components: supervised contrastive learning to enhance discriminative representation learning, image modality integration, and a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. The proposed framework improves Unweighted Average Recall (UAR) by 7.1% (from 68.1% to 75.2%) and F1 score by 2.9% (from 80.6% to 83.5%) compared to existing text-only unimodal benchmarks. Furthermore, the contrastive learning component demonstrates greater performance gains for text than speech.