Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Created by
  • Haebom

Author

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi

Outline

UI-TARS-2 is an autonomous agent model for graphical user interfaces (GUIs). It presents a systematic training methodology to address issues such as data scalability, multi-lapsing reinforcement learning (RL), limitations of GUI-only operation, and environmental stability. This methodology consists of a data flywheel for scalable data generation, a stabilized multi-lapsing RL framework, a hybrid GUI environment integrating a file system and a terminal, and an integrated sandbox platform for large-scale deployment. Experimental results demonstrate that UI-TARS-2 significantly improves performance over its predecessor, UI-TARS-1.5, achieving competitive performance across various GUI benchmarks, game environments, information exploration tasks, and software engineering benchmarks.

Takeaways, Limitations

Takeaways:
Provides insights into achieving stability and efficiency in large-scale GUI agent RL.
It demonstrates strong generalization ability across a variety of agent tasks.
It contributes to the advancement of GUI agents and demonstrates their ability to generalize to real-world interaction scenarios.
It outperforms existing models (Claude, OpenAI agents, etc.) on various GUI benchmarks, including Online-Mind2Web, OSWorld, WindowsAgentArena, and AndroidWorld.
It demonstrated performance reaching approximately 60% of human-level performance in a gaming environment, making it competitive with cutting-edge proprietary models.
It has also demonstrated generalization capabilities in long-term information exploration tasks and software engineering benchmarks.
Limitations:
The specific Limitations is not explicitly mentioned in this paper. Further improvement may be needed through future research.
👍