This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Created by
Haebom
Author
Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma
Outline
This paper proposes UITron-Speech, a voice-based GUI agent. To overcome the accessibility and usability limitations of existing text-based GUI agents, we develop the first end-to-end GUI agent that directly processes voice commands and on-device screenshots to predict user behavior. To address data insufficiency, we synthesize a high-quality voice command dataset using a random speaker text-to-speech model and design a mixed-modality training strategy to mitigate the modality imbalance problem of pre-trained base models. Furthermore, we perform a statistical analysis of the GUI grounding prediction error distribution and propose a training-free, two-step grounding improvement method to mitigate minor positional errors. Extensive experiments on various benchmarks demonstrate that UITron-Speech achieves robust performance and excellent adaptability, highlighting the feasibility and potential of voice-based GUI agents. The code and dataset are available at https://github.com/UITron-hub/UITron-Speech .