Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Real-World Offline Reinforcement Learning from Vision Language Model Feedback

Created by
  • Haebom

Author

Sreyas Venkataraman, Yufei Wang, Ziyu Wang, Navin Sriram Ravie, Zackory Erickson, David Held

Outline

This paper discusses offline reinforcement learning (ORL), which enables policy learning from pre-collected, suboptimal datasets without online interaction. This approach is particularly suitable for real-world robots or safety-critical scenarios where online data collection or expert demonstrations are slow, expensive, and dangerous. Most existing offline RL studies assume that the dataset is already labeled with task rewards, but this requires significant effort, especially in real-world scenarios where ground truth is difficult to determine. In this paper, we propose a novel system based on RL-VLM-F, which automatically generates reward labels for offline datasets using preference feedback from a vision-language model and textual descriptions of tasks. This method trains policies using offline RL with the labeled reward dataset. We demonstrate its applicability to the complex task of dressing a real robot. We first learn reward functions from a suboptimal offline dataset using a vision-language model, and then use the learned rewards to develop an effective dressing policy through implicit Q learning. It also performs well on simulation tasks involving manipulation of rigid and deformable objects, significantly outperforming baselines such as behavior cloning and inverse reinforcement learning (RL). In summary, we propose a novel system that enables automatic reward labeling and policy learning from unlabeled, suboptimal offline datasets.

Takeaways, Limitations

Takeaways:
We present a novel method for automatically generating reward labels for offline datasets using vision-language models, thereby increasing the real-world applicability of offline reinforcement learning.
It outperforms existing methods in both actual robot dressing tasks and simulation tasks.
We demonstrate the effectiveness of offline reinforcement learning for complex tasks.
Limitations:
It depends on the performance of the vision-language model, and a degradation in the model's performance can affect the performance of the entire system.
Further validation of the generalization ability of the vision-language model used is needed.
Research is needed to further improve the generalizability of reward function learning for specific tasks.
Performance may vary depending on the variety and complexity of real-world datasets.
👍