This paper proposes a novel semi-supervised learning-based multimodal medical image classification method based on a "pre-training + fine-tuning" framework to address the modality fusion problem of multimodal medical images in situations where expert annotation data is scarce. A synergistic learning pre-training framework that integrates consistency, reconstruction, and alignment learning allows us to treat one modality as an augmented sample of another modality, thereby performing self-supervised learning and enhancing the feature representation capability of the base model. We then design a fine-tuning method for multimodal fusion, utilizing modality-specific feature extractors and multimodal fusion feature extractors. To mitigate prediction uncertainty and overfitting risks due to the lack of labeled data, we propose a distribution shift method for multimodal fusion features. Experimental results using the Kvasir and Kvasirv2 gastrointestinal endoscopy image datasets demonstrate that the proposed method outperforms existing state-of-the-art classification methods. The source code will be made available on GitHub.