Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Created by
  • Haebom

Author

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

Outline

GLM-4.1V-Thinking and GLM-4.5V are vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. This paper shares key findings from the development of an inference-driven training framework. Large-scale pretraining was used to develop promising vision-based models, and reinforcement learning and curriculum sampling (RLCS) were then proposed to enhance the models' capabilities across a variety of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long-text interpretation. In a comprehensive evaluation on 42 public benchmarks, GLM-4.5V achieved state-of-the-art performance across nearly all tasks among similarly sized open-source models, and was competitive or superior to closed-source models such as Gemini-2.5-Flash on challenging tasks such as coding and GUI agents. The smaller GLM-4.1V-9B-Thinking model also maintained its competitiveness, outperforming the much larger Qwen2.5-VL-72B model on 29 benchmarks. Both GLM-4.1V-9B-Thinking and GLM-4.5V are open source.

Takeaways, Limitations

Takeaways:
An effective VLM training framework combining large-scale pre-training and RLCS is presented.
Introducing the GLM-4.1V-Thinking and GLM-4.5V models, which demonstrate excellent performance across a variety of tasks.
Securing competitiveness with closed-source models as an open-source model.
Demonstrated superior performance relative to model size.
Limitations:
The paper lacks specific references to Limitations or future research directions.
Since performance comparisons are mainly conducted for specific tasks, an in-depth analysis of the model's generalization ability is required.
Despite being open source, the complexity of the model may lead to accessibility issues.
👍