Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MiniCPM4: Ultra-Efficient LLMs on End Devices

Created by
  • Haebom

Author

MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Qiuzuo Li, Siyuan Li, Wenhao Li, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Zheng Wang, Yesai Wu, Zhenyu Xiao, Jie Zhou, Jie Zhou, Wei Zhou, Yanghao Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun

Outline

MiniCPM4 is a highly efficient large-scale language model (LLM) designed for end-user devices. It achieves efficiency through innovations in four key areas: model architecture (InfLLM v2), training data (UltraClean, UltraChat v2), training algorithms (ModelTunnel v2, chunk-wise rollout, BitCPM), and inference system (CPM.cu). InfLLM v2 is a trainable sparse attention mechanism that accelerates the pre-filling and decoding steps for long-context processing. UltraClean and UltraChat v2 are efficient and accurate pre-training data filtering and generation strategies, as well as comprehensive supervised learning fine-tuning datasets. Using these datasets, we achieved satisfactory model performance with only 8 trillion training tokens. ModelTunnel v2 is an algorithm for efficient pre-training strategy search, improving upon existing post-training methods through chunk-wise rollout and BitCPM. CPM.cu integrates sparse attention, model quantization, and speculative sampling to achieve efficient pre-filling and decoding. To accommodate diverse device requirements, we present MiniCPM4.1, a hybrid inference model available in two versions, with 0.5B and 8B parameters, and usable in both deep inference and non-inference modes. Our evaluations show that MiniCPM4 and MiniCPM4.1 outperform similarly sized open-source models on benchmarks, with the 8B version in particular demonstrating significant speedups in long sequence understanding and generation.

Takeaways, Limitations

Takeaways:
Demonstrates the potential for developing large-scale language models that operate efficiently on end-user devices.
A novel architecture and algorithm are presented to improve the speed of long context processing.
Reduce training data size through efficient data filtering and generation strategies.
Available in a variety of model versions to meet diverse device requirements.
Superior performance and speed improvements compared to similarly sized open source models.
Limitations:
Lack of detailed analysis of the performance and efficiency of the hybrid inference model in MiniCPM4.1.
Further research is needed to determine the generalizability of the innovative technologies presented.
A more comprehensive comparative analysis with other LLMs is needed.
The training data size of 8 trillion tokens is still significant, and research is needed to find ways to maintain performance with even less data.
👍