This paper presents a method for integrating features from multiple image modalities for diverse medical image analysis. Existing CLIP-based approaches require paired data across different modalities, which is difficult to obtain in medical image data. To address this, we propose a novel pre-training framework, Multimodal Medical Image Binding with Text (M³Bind). M³Bind seamlessly aligns multiple modalities through a shared text representation space without requiring explicit paired data between different medical image modalities. Specifically, M³Bind fine-tunes a pre-trained CLIP-like image-text model to align the text embedding spaces of each modality and then distills modality-specific text encoders into a unified model to generate a shared text embedding space. Experimental results on X-ray, CT, retina, ECG, and pathology images demonstrate that M³Bind outperforms CLIP-like models on zero-shot and few-shot classification and cross-modal retrieval tasks.