Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Created by
  • Haebom

Author

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, Ziwei Wang

Outline

In this paper, we present VLA-RL, an algorithmic and systematic framework that leverages online reinforcement learning (RL) to improve pre-trained autoregressive vision-language-action (VLA) models for downstream tasks. To address the out-of-distribution failure of existing VLA models that use offline data with only limited states visited, we propose an exploration-based method to improve data collected online at test time. We introduce a trajectory-level RL formulation for autoregressive VLA training, and fine-tune the pre-trained vision-language model as a robot process reward model using pseudo-reward labels annotated with automatically extracted task segments to address the sparse reward problem. We also present implementation results of curriculum selection strategies, GPU-balanced vectorization environments, batch decoding, and critic warmup to improve stability and efficiency. We demonstrate that OpenVLA-7B outperforms the existing state-of-the-art baseline model by 4.5% on 40 challenging robot manipulation tasks from LIBERO, and achieves comparable performance to advanced commercial models such as $\pi_0$-FAST. We demonstrate early signs of the scaling law of inference in robotics by observing the benefits of test-time optimization.

Takeaways, Limitations

Takeaways:
We present an effective framework to improve the performance of pre-trained VLA models via online reinforcement learning.
Overcoming the limitations of limited offline data and improving robot manipulation performance in out-of-distribution situations.
Demonstrating the importance of test time optimization and providing new insights into the scaling law of inference in robotics.
Achieve performance comparable to high-end commercial models.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Scalability verification across a variety of robotic platforms and tasks is needed.
Further analysis is needed on the accuracy and reliability of physician compensation labels.
Consideration needs to be given to the computational cost and time constraints of online learning.
👍