Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Created by
  • Haebom

Author

Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

Outline

This paper proposes UITron-Speech, a voice-based GUI agent. To overcome the accessibility and usability limitations of existing text-based GUI agents, we develop the first end-to-end GUI agent that directly processes voice commands and on-device screenshots to predict user behavior. To address data insufficiency, we synthesize a high-quality voice command dataset using a random speaker text-to-speech model and design a mixed-modality training strategy to mitigate the modality imbalance problem of pre-trained base models. Furthermore, we perform a statistical analysis of the GUI grounding prediction error distribution and propose a training-free, two-step grounding improvement method to mitigate minor positional errors. Extensive experiments on various benchmarks demonstrate that UITron-Speech achieves robust performance and excellent adaptability, highlighting the feasibility and potential of voice-based GUI agents. The code and dataset are available at https://github.com/UITron-hub/UITron-Speech .

Takeaways, Limitations

Takeaways:
Demonstrates the feasibility and accessibility of voice-based GUI agents.
Presenting effective data synthesis and training strategies to address data shortage issues.
Proposing an efficient method to improve GUI grounding errors.
Presenting new possibilities for more convenient and intelligent human-computer interaction.
Limitations:
Further research is needed on the generalization performance of the method presented in this paper.
Robustness assessment across diverse speech and language environments is needed.
Performance evaluation and user experience research in actual usage environments are needed.
Applicability verification for complex GUIs or various types of GUIs is required.
👍