Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Created by
  • Haebom

Author

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

Outline

GLM-4.1V-Thinking and GLM-4.5V are vision-language models (VLMs) designed to enhance general-purpose multimodal understanding and reasoning. This paper shares key findings on the development of an inference-driven training framework. After developing a capable vision-based model with significant potential through large-scale pretraining, we propose reinforcement learning and curriculum sampling (RLCS) to fully exploit the model's potential across a wide range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation on 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance across nearly all tasks among similarly sized open-source models, and is competitive or superior to closed-source models such as Gemini-2.5-Flash on challenging tasks such as coding and GUI agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive, outperforming the much larger Qwen2.5-VL-72B on 29 benchmarks. Both GLM-4.1V-9B-Thinking and GLM-4.5V are open source.

Takeaways, Limitations

Takeaways:
We demonstrate the effectiveness of an inference-driven training framework that combines large-scale pre-training and RLCS.
Provides an open source VLM that performs well in a variety of tasks.
The GLM-4.5V achieves state-of-the-art performance among similarly sized open-source models, and outperforms closed-source models in some tasks.
The GLM-4.1V-9B-Thinking demonstrates efficiency by outperforming larger models.
Contribute to research and development by open-sourcing models and code.
Limitations:
This paper does not explicitly address specific Limitations. Further research is expected to yield improvements (e.g., improved performance on specific tasks, enhanced model scalability, and improved generalization).
👍