Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

Created by
  • Haebom

Author

Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, Tat-Seng Chua

Outline

Graphical user interface (GUI) agents based on large-scale visual-language models (LVLMs) have emerged as an innovative approach for autonomously operating personal devices or applications to perform complex, real-world tasks. However, their tight integration with personal devices poses numerous threats, including backdoor attacks, which remain largely unexplored. This study reveals that the visual foundation that maps text plans to GUI elements in GUI agents introduces vulnerabilities, enabling a new type of backdoor attack. Backdoor attacks targeting the visual foundation can corrupt the agent's behavior even when given an accurate task-solving plan. To verify this vulnerability, this study proposes a method called VisualTrap, which exploits the foundation by tricking the agent into finding text plans at trigger locations other than the intended target. VisualTrap utilizes a common method of injecting poisoned data into the attack, ensuring the feasibility of the attack by performing this task during visual-based pretraining. Experimental results demonstrate that VisualTrap can effectively exploit visual-based attacks using only 5% of the poisoned data and highly stealthy visual triggers (invisible to the human eye). This attack can be generalized to downstream tasks even after careful fine-tuning. Furthermore, the injected triggers can be effective across a variety of GUI environments, including being trained on mobile/web and generalized to desktop environments. These results highlight the need for further research into the risk of backdoor attacks on GUI agents.

Takeaways, Limitations

Takeaways: By revealing the possibility of backdoor attacks on the visual foundation of GUI agents and presenting a practical attack method, VisualTrap, we raised awareness of the importance of GUI agent security and its vulnerabilities. By demonstrating that attacks are possible with as little as 5% of the poisoned data and invisible triggers, we emphasize the severity of the actual threat. Furthermore, we demonstrate the generalizability of the attack, suggesting its potential for use in a variety of environments.
Limitations: VisualTrap currently focuses solely on visual attacks and does not consider other attack vectors (e.g., vulnerabilities in the language model itself). Generalizing test results to specific GUI environments may be limited. Further research is needed on a wider range of GUI agents and environments. Furthermore, research on developing defense mechanisms against VisualTrap is still insufficient.
👍