Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

Created by
  • Haebom

Author

Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, Xianglong Liu, Qi Dou, Yaodong Yang, Huijie Zhao, Weifeng Lv, Simin Li

Outline

This paper focuses on improving the robustness of the Vision-Language-Action (VLA) model in real-world environments. We address the limitations of existing research, which focuses solely on visual disturbances, and evaluate its robustness across various modalities, including actions, instructions, environments, and observations. Evaluating 17 different disturbances, we find that the action modality is the most vulnerable. To address this, we propose RobustVLA and present methods for ensuring output and input robustness. Specifically, we demonstrate offline robust optimization for worst-case behavioral noise, maintain consistent behavior across input variations, and solve the multi-bandit problem considering multiple disturbances. We demonstrate the superior performance of RobustVLA through experiments on the LIBERO dataset and real-world robots.

Takeaways, Limitations

Takeaways:
Expanding the scope of robustness studies of the VLA model to consider disturbances of various modalities.
Identify vulnerabilities in behavioral modalities and suggest effective ways to address them.
Robustness is achieved through offline robust optimization, maintaining input consistency, and solving multi-bandit problems.
We verify the effectiveness of the proposed method through experiments on the LIBERO dataset and real robots.
Achieves faster inference speed than existing visually robust VLA models.
Limitations:
Relies on experimental results for a specific dataset (LIBERO) and robot environment (FR5).
Although it takes into account all the various distractions, it does not perfectly cover all distractions in real environments.
Further analysis of the complexity and computational cost of the proposed method may be required.
👍