This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
Created by
Haebom
Author
Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, Xianglong Liu, Qi Dou, Yaodong Yang, Huijie Zhao, Weifeng Lv, Simin Li
Outline
This paper focuses on improving the robustness of the Vision-Language-Action (VLA) model in real-world environments. We address the limitations of existing research, which focuses solely on visual disturbances, and evaluate its robustness across various modalities, including actions, instructions, environments, and observations. Evaluating 17 different disturbances, we find that the action modality is the most vulnerable. To address this, we propose RobustVLA and present methods for ensuring output and input robustness. Specifically, we demonstrate offline robust optimization for worst-case behavioral noise, maintain consistent behavior across input variations, and solve the multi-bandit problem considering multiple disturbances. We demonstrate the superior performance of RobustVLA through experiments on the LIBERO dataset and real-world robots.
Takeaways, Limitations
•
Takeaways:
◦
Expanding the scope of robustness studies of the VLA model to consider disturbances of various modalities.
◦
Identify vulnerabilities in behavioral modalities and suggest effective ways to address them.
◦
Robustness is achieved through offline robust optimization, maintaining input consistency, and solving multi-bandit problems.
◦
We verify the effectiveness of the proposed method through experiments on the LIBERO dataset and real robots.
◦
Achieves faster inference speed than existing visually robust VLA models.
•
Limitations:
◦
Relies on experimental results for a specific dataset (LIBERO) and robot environment (FR5).
◦
Although it takes into account all the various distractions, it does not perfectly cover all distractions in real environments.
◦
Further analysis of the complexity and computational cost of the proposed method may be required.