Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

Created by
  • Haebom

Author

Yanda Li, Chi Zhang, Wenjia Jiang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, Yunchao Wei

Outline

This paper presents a novel LLM-based multimodal agent framework operating on mobile devices. This framework explores mobile devices through human-like interactions and constructs a flexible action space that adapts to a variety of applications, including parsing, text, and vision explanation. It consists of two phases: exploration and deployment. In the exploration phase, the functionality of user interface elements is documented in a structured, user-defined knowledge base through agent-driven or manual exploration. In the deployment phase, RAG technology is used to efficiently retrieve and update this knowledge base, enabling the agent to efficiently and accurately perform complex, multi-step tasks across a variety of applications. Experimental results on various benchmarks demonstrate the effectiveness of the framework in real-world scenarios, demonstrating excellent performance. It will soon be open-sourced.

Takeaways, Limitations

Takeaways:
Presenting new possibilities for an LLM-based multimodal agent framework on mobile devices.
Improved adaptability to various applications and high-precision work performance
Capable of handling complex, multi-step tasks
Efficient knowledge base management and updates using RAG technology
Improving accessibility through soon-to-be-released open-source code
Limitations:
Based on the information currently available, it is impossible to confirm detailed information on specific performance indicators and limitations.
Further validation is needed on the applicability and generalization performance across various real-world mobile environments and applications.
After the open source code is released, it is necessary to evaluate its performance and stability in an actual usage environment.
👍