Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

GUI-PRA: Process Reward Agent for GUI Tasks

Created by
  • Haebom

Author

Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, Shengyu Zhang

GUI-PRA: Process Compensation Agent for GUI Tasks

Outline

GUI agents based on multimodal large-scale language models (MLLMs) show great potential for task automation, but often struggle with long-term tasks. Process compensation models (PRMs) are a promising solution for guiding these agents through important process signals during inference, but their application to the GUI domain presents unique challenges. When handling dense artificial inputs with long historical data, PRMs suffer from the "lost in the middle" phenomenon, where excessive past context impairs the current stage's evaluation. Furthermore, standard PRMs are insensitive to GUI changes, providing static evaluations that are inherently incompatible with the dynamic nature of GUI tasks. To address these challenges, we introduce GUI-PRA (Process Compensation Agent for GUI Tasks), a judge agent designed to intelligently process historical context and actively recognize UI state changes, thereby providing better process compensation than standard PRMs. Specifically, to directly address the "forgetting-in-the-middle" phenomenon, we introduce a dynamic memory mechanism consisting of a relevance-based retrieval module that actively retrieves relevant information from a long history and a progressive summarization module that dynamically summarizes increasing interaction data, allowing the model to focus on relevant context. Furthermore, to address the lack of UI change awareness, we introduce an adaptive UI awareness mechanism. This mechanism enables the agent to dynamically select the most appropriate tools to reason about UI state changes and gather supporting visual evidence, ensuring that evaluations are always informed by the current UI context.

Takeaways, Limitations

Takeaways:
GUI-PRA presents a novel approach to improve the performance of MLLM-based agents in GUI tasks.
Dynamic memory mechanisms address the "forget-it-behind" phenomenon and help agents focus on relevant context.
Adaptive UI awareness mechanisms improve the agent's assessment by taking into account UI state changes.
Limitations:
The paper lacks specific experimental results or performance comparison information.
Lack of implementation details and description of specific algorithms.
Further research is needed to determine the generalizability of the proposed mechanism and its applicability to various GUI tasks.
👍