Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Aria-UI: Visual Grounding for GUI Instructions

Created by
  • Haebom

Author

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, Junnan Li

Outline

In this paper, we present Aria-UI, a novel large-scale multimodal model for digital agents that automate tasks by directly manipulating GUIs on diverse platforms. Aria-UI addresses the challenge of linking language instructions to target elements by adopting a purely vision approach, without relying on HTML or AXTree inputs. It adapts to heterogeneous plan instructions via a scalable data pipeline that generates diverse and high-quality instruction samples, and enhances context-aware inference by integrating mixed text and text-image task histories to handle dynamic context during task execution. Experimental results show that Aria-UI achieves state-of-the-art performance on both offline and online agent benchmarks, outperforming existing vision-only and AXTree-based models. All training data and model checkpoints are publicly available.

Takeaways, Limitations

Takeaways:
We present a novel multi-modal model, Aria-UI, that contributes to improving the performance of GUI-based task automation agents.
Removing dependency on HTML or AXTree inputs allows for more robust and general agent development.
Improved adaptability to diverse job instructions through scalable data pipelines.
Connect target elements more accurately with context-aware inference leveraging text and text-image blending task history.
Presenting the possibility of continuous research development through open source disclosure.
Limitations:
In this paper, we evaluated the performance of Aria-UI on various benchmarks, but additional verification of its generalization performance in various real GUI environments may be necessary.
There may be bias towards certain types of GUI or tasks.
Limitations on the scalability of data pipelines and the need for continuous management of data quality.
Further research may be needed on the ability to process complex and ambiguous task instructions.
👍