Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

Created by
  • Haebom

Author

Feilong Chen, Yijiang Liu, Yi Huang, Hao Wang, Miren Tian, Ya-Qi Yu, Minghui Liao, Jihao Wu

Outline

We propose MindVL, a multimodal large-scale language model (MLLM) trained on Ascend NPUs. MindVL aims to overcome the dependence on closed data recipes and hardware platforms that hinder open research and reproducibility. It supports stable, high-performance training of large-scale Dense and Mixture-of-Experts (MoE) models on Ascend hardware through an efficient training framework called MindSpeed-MLLM. Furthermore, we provide a systematic and open explanation of the data preparation method and mixing strategy. MindVL is a data-efficient MLLM trained end-to-end on Ascend NPUs. We improve performance by combining weight averaging across checkpoints of various sequence lengths and test-time resolution search. MindVL-8B achieves comparable performance to Qwen2.5VL-7B with 10% of the data, and MindVL-671B-A37B, an MoE model, achieves comparable performance with 3% of the data of Qwen2.5VL-72B.

Takeaways, Limitations

Takeaways:
Changing the perception that Ascend hardware is not suitable for MLLM training.
Improving reproducibility and research accessibility by providing open data recipes.
Development of a data-efficient MLLM model (MindVL).
We present a performance enhancement technique using trained sequence length weighted averaging and test-time resolution search.
Achieves competitive performance compared to other leading MLLM models.
Limitations:
The paper may not provide detailed information about the specific data volume or model architecture. (Limitations of the abstract)
The generalizability of the results may require further validation.
Information may be lacking for comparative analysis with other hardware platforms.
👍