This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li
Outline
In this paper, we study how graphical user interface (GUI) agents interact with visual elements to perform tasks on various platforms. A user's instruction is decomposed into a series of action proposals, each of which corresponds to an interaction with the GUI. The agent plans its next steps by observing the updated GUI environment after each action. In this paper, we address two major challenges: resolving ambiguity in task planning and accurately executing actions on high-resolution interfaces. To this end, we present the GUI Test-time Scaling Agent (GTA1), which introduces a test-time scaling method to select optimal action proposals and improves the accuracy of action execution on visual elements by leveraging reinforcement learning (RL). Experimental results show that it achieves state-of-the-art performance on a variety of benchmarks.
Takeaways, Limitations
•
Takeaways:
◦
The test time scaling technique effectively resolves the ambiguity of the work plan and shortens the work execution steps, thereby improving the overall performance.
◦
We achieve accurate action execution on high-resolution interfaces using a visual grounding model leveraging reinforcement learning.
◦
We demonstrate the potential of GUI agents to improve performance by achieving state-of-the-art performance on a variety of benchmarks.
◦
We made our code and models public to increase the reproducibility and scalability of our research.
•
Limitations:
◦
Further studies are needed to investigate the generalization performance of the proposed method.
◦
Testing is needed for more complex and diverse GUI environments.
◦
Additional analysis of the computational cost of test time scaling methods may be required.