Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

Created by
  • Haebom

Author

Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhiheng Xi, Zhihui Cao, Hailiang Pang, Heng Kong, He Yang, Mingxu Chai, Zhilin Gao, Xingyu Liu, Yingnan Fu, Jiaming Liu, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang, Kang Wang, Yunke Zhang, Yuran Wang

Outline

MagicGUI is a fundamental mobile GUI agent designed to address the critical challenges of perception, foundational building, and reasoning in real-world mobile GUI environments. MagicGUI is built on six key components: (1) a comprehensive and accurate dataset built via a scalable GUI data pipeline (the largest and most diverse GUI-centric multimodal data collected from open source repositories, automated crawling, and targeted manual annotation); (2) enhanced perceptual and foundation-building capabilities that facilitate fine-grained multimodal alignment for UI element reference, foundation-building, and screen understanding; (3) a comprehensive and unified action space that encompasses both basic UI tasks and complex interaction intents; (4) a plan-driven inference mechanism that can decompose complex user instructions into sequential actions using explicit intermediate meta-planning inference; (5) an iterative two-stage training procedure that combines large-scale continuous pre-training on 7.8 million samples with reinforcement learning fine-tuning utilizing spatially enhanced compound reward and double filtering strategies; and (6) achieves competitive performance on the proprietary Magic-RICH benchmark and over a dozen public benchmarks, demonstrating superior performance across GUI perception and agent tasks, and demonstrating strong generalization and real-world deployability in real-world mobile GUI scenarios, as detailed in Figure 1.

Takeaways, Limitations

Takeaways:
We present a novel approach to solving perception, foundational construction, and inference problems in real-world mobile GUI environments.
Powerful performance leveraging large multi-mode GUI datasets.
The ability to perform complex tasks through plan-oriented reasoning mechanisms.
Excellent generalization performance demonstrating deployability in real-world environments.
Limitations:
Lack of detailed description of the performance of the proprietary Magic-RICH benchmark.
Possible limitations of generalization performance across various mobile GUI environments.
Further validation is needed to demonstrate the ability to cope with unexpected situations that may arise in real-world applications.
Lack of specific discussion on scalability and maintainability of data pipelines.
👍