Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Created by
  • Haebom

Author

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xuancheng Huang, Yanling Wang, Yadong Zhang, Zhanxiao Du, Zhenyu Hou, Zhao Xue, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

Outline

GLM-4.1V-Thinking and GLM-4.5V are vision-language models (VLMs) designed to enhance general-purpose multimodal understanding and reasoning. This paper shares key findings from the development of an inference-driven training framework. We developed a promising vision-based model through large-scale pretraining, and then proposed reinforcement learning and curriculum sampling (RLCS) to improve its performance across a wide range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long-text interpretation. In a comprehensive evaluation on 42 public benchmarks, GLM-4.5V achieved state-of-the-art performance across nearly all tasks among similarly sized open-source models, and was competitive or better than closed-source models such as Gemini-2.5-Flash on challenging tasks such as coding and GUI agents. The smaller GLM-4.1V-9B-Thinking model also maintained its competitiveness, outperforming Qwen2.5-VL-72B on 29 benchmarks. Both GLM-4.1V-9B-Thinking and GLM-4.5V are open source.

Takeaways, Limitations

Takeaways:
We demonstrate the effectiveness of an inference-driven training framework that combines large-scale pre-training and RLCS.
Providing an open source VLM model that demonstrates competitive performance across a variety of tasks.
The GLM-4.5V is the highest performing open-source model of its size, and even outperforms closed-source models in some tasks.
The GLM-4.1V-9B-Thinking outperforms much larger models.
Contribute to research and development by open-sourcing models and code.
Limitations:
The specific Limitations is not explicitly mentioned in the paper. There is room for improvement through future research.
Performance differences on specific benchmarks may be due to differences in model architecture or training data and require further analysis.
👍