Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

Created by
  • Haebom

Author

Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, Tieniu Tan

Outline

This paper introduces BridgeVLA, a novel 3D Vision-Language-Action (VLA) model for effective robot manipulation learning, leveraging a pre-trained Vision-Language Model (VLM). BridgeVLA projects 3D inputs into multiple 2D images and predicts actions using 2D heatmaps, ensuring alignment with the VLM backbone and unifying the input and output spaces within a consistent 2D image space. Furthermore, we propose a scalable pre-training method that incorporates the VLM backbone to predict 2D heatmaps before downstream policy learning. Experimental results demonstrate that BridgeVLA outperforms state-of-the-art baselines on three simulated benchmarks, achieving an average success rate of 88.2% on RLBench, 64.0% on COLOSSEUM, and outperforming all other baselines on GemBench. In experiments with real robots, BridgeVLA outperforms the state-of-the-art baseline by 32% and demonstrates strong generalization across multiple out-of-distribution settings, including visual distraction and novel instructions.

Takeaways, Limitations

Takeaways:
Integrating 3D data into VLM to improve the efficiency and performance of robot manipulation learning.
Ensure consistency in data processing by unifying input and output within the 2D image space.
Improving the 2D heatmap prediction capability of the VLM backbone through a scalable pretraining method.
Demonstrated excellent performance in simulations and real robot experiments
Demonstrates excellent sample efficiency by achieving high success rates with a small number of trajectories.
Limitations:
The specific Limitations is not explicitly mentioned in the paper (probably due to the possibility of information loss during processing and conversion of 3D data to 2D, the complexity of the model, computational cost, etc.)
👍