MobileCLIP is an image-to-text model that achieves state-of-the-art zero-shot accuracy with a low latency of 3-15 ms and parameters ranging from 50 to 150 million. In this paper, we present MobileCLIP2, an improved version of multi-modal reinforcement learning. These improvements include an enhanced CLIP teacher ensemble trained on the DFN dataset and an enhanced caption generator teacher fine-tuned on various high-quality image-caption datasets. We experimentally demonstrate the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of fine-tuning the caption generator for caption diversity, and further improvements in combining synthetic captions generated by multiple models. Consequently, MobileCLIP2 achieves state-of-the-art zero-shot accuracy on ImageNet-1k, and MobileCLIP2-B achieves a 2.2% accuracy improvement over MobileCLIP-B. The MobileCLIP2-S4 achieves the same zero-shot accuracy as the SigLIP-SO400M/14, but is twice as small and has 2.5x lower latency than the DFN ViT-L/14. The trained model and data generation code are publicly available.