This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Generalist Bimanual Manipulation via Foundation Video Diffusion Models
Created by
Haebom
Author
Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, Jun Zhu
Outline
VIDAR is a two-stage framework that uses large-scale video-based pre-training and a novel mask dynamics model to address data scarcity and entity heterogeneity issues to increase the scalability of dual-robot manipulation. Using 750K multi-view videos, we pre-train a video diffusion model, and a mask dynamics model that extracts action-relevant information from masks without pixel-wise labels. We demonstrate that it generalizes well to new tasks and backgrounds with only 20 minutes of human demonstrations (1% of the typical data requirement) on a novel robotic platform.
Takeaways, Limitations
•
Takeaways:
◦
By combining large-scale video-based pre-learning with a mask dynamics model, we significantly improve the scalability and generalization performance of dual robot manipulation.
◦
It suggests the possibility of building a robot manipulation system that can adapt to various tasks and backgrounds even with a small amount of data.
◦
We demonstrate the applicability of video-based foundation models to the field of robotic manipulation.
•
Limitations:
◦
Since we currently use a dataset limited to three real-world dual robot platforms, validation of generalization performance to more diverse robot platforms and environments is needed.
◦
There is a lack of detailed description of the learning process and mask generation mechanism of the mask dynamics model.
◦
Performance evaluation is needed for more challenging tasks involving long-term tasks or complex interactions.