Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

Created by
  • Haebom

Author

Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen

Outline

This paper proposes NaturalGAIA, a novel benchmark based on the principle of causal pathways (CPAs), to address the accuracy, reproducibility, and scalability limitations of existing evaluation benchmarks that hinder the development of large-scale language model (LLM)-based graphical user interface (GUI) agents. NaturalGAIA provides rigorous, fully automated, and reproducible evaluation criteria by structuring complex tasks into a series of programmatically verifiable, atomic steps. Furthermore, to mitigate the inherent functional flaws of agents, we develop LightManus, a hierarchical agent architecture optimized for long-term tasks. This architecture is used to generate a high-quality human-validated dataset that captures the diverse and self-correcting interaction patterns of LLMs. Using this dataset, we perform Reinforcement Learning Fine-Tuning (RFT) on the Qwen2.5-VL-7B model. Experimental results demonstrate that NaturalGAIA presents significant challenges even to state-of-the-art LLMs, with the best-performing model, Claude-sonnet-4, achieving a weighted path success rate (WPSR) of only 34.6%. While RFT improved the GUI execution capability of small models (WPSR increased from 3.3% to 10.8%), performance degraded significantly in complex scenarios, demonstrating the inherent performance limitations of small models when faced with comprehensive tasks that integrate perception, decision-making, and execution. This study provides rigorous evaluation criteria and a high-quality dataset, offering guidance for the future development of GUI agents.

Takeaways, Limitations

Takeaways:
We present NaturalGAIA, a new rigorous and reproducible benchmark for evaluating LLM-based GUI agents.
Development of LightManus, a hierarchical agent architecture optimized for long-term tasks, and generation of high-quality datasets.
Experimentally demonstrating the effectiveness and limitations of RFT for improving the GUI execution capability of LLM.
Provides a realistic assessment of the GUI performance capabilities of current state-of-the-art LLMs.
Limitations:
Further research is needed on the scalability and generalizability of the NaturalGAIA benchmark.
Further analysis is needed to understand why the effectiveness of RFT varies significantly with model size.
Benchmarks need to be expanded to cover more diverse and complex GUI tasks.
Generalizability of the LightManus architecture to other LLMs and tasks needs to be verified.
👍