This paper focuses on extending the Mixed-Experts (MoE) architecture, an important approach for large-scale language model scaling, to multimodal tasks. Existing multimodal MoE model building methods suffer from high training costs or poor language capabilities when applying pre-trained models. To address these issues, we propose a novel regularization technique, Soft Modal-Aware Routing (SMAR), which encourages expert specialization without modifying the model architecture or relying heavily on text data. SMAR controls the routing probability distribution across modalities using Kullback-Leibler divergence. Experimental results of visually guided tuning show that SMAR outperforms the baseline model while maintaining 86.6% language capability with only 2.5% pure text, and maintains strong multimodal performance. This study provides a practical and efficient solution to balance modal differentiation and language capabilities in multimodal MoE models.