Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities

Created by
  • Haebom

Author

Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang

Outline

This paper focuses on extending the Mixed-Experts (MoE) architecture, an important approach for large-scale language model scaling, to multimodal tasks. Existing multimodal MoE model building methods suffer from high training costs or poor language capabilities when applying pre-trained models. To address these issues, we propose a novel regularization technique, Soft Modal-Aware Routing (SMAR), which encourages expert specialization without modifying the model architecture or relying heavily on text data. SMAR controls the routing probability distribution across modalities using Kullback-Leibler divergence. Experimental results of visually guided tuning show that SMAR outperforms the baseline model while maintaining 86.6% language capability with only 2.5% pure text, and maintains strong multimodal performance. This study provides a practical and efficient solution to balance modal differentiation and language capabilities in multimodal MoE models.

Takeaways, Limitations

Takeaways:
We propose that SMAR is an efficient regularization technique to reduce the training cost of multimodal MoE models and address the problem of language degradation.
We demonstrate that robust multimodal performance and linguistic ability can be maintained simultaneously with only a small amount of text data.
Increases the practical construction and applicability of multimodal MoE models.
Limitations:
The performance of SMAR is based on experimental results for a specific visual guidance tuning task, and generalization performance to other tasks or datasets requires further study.
Further research may be needed on optimal parameter settings for routing control methods using Kullback-Leibler divergence.
Generalization performance validation for different modal combinations (e.g., text, images, audio) is needed.
👍