To address the inability of ViT-based large multimodal models (LMMs) to capture subtle visual differences in geometric scenarios, this paper proposes a hard-negative contrast learning framework. This framework combines image-based contrast learning using generative hard negatives generated by modifying diagram generation code, and text-based contrast learning using rule-based negatives derived from retrieval-based negatives selected based on modified geometric descriptions and caption similarities. Using the proposed hard-negative training method, Multimodal Math CLIP (MMCLIP), the authors train a visual encoder (CLIP), which in turn trains an LMM for solving geometric problems. Experimental results show that the 7-byte MMGeoLM model significantly outperforms other open-source models on three geometric inference benchmarks, achieving performance comparable to powerful closed-form models such as GPT-4o. Additionally, through analysis of hard negative types, the efficiency of image-based negatives, and training configurations, we gain insights into optimizing the visual encoder training pipeline for fine-grained geometric inference tasks.