[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Generalist Bimanual Manipulation via Foundation Video Diffusion Models

Created by
  • Haebom

Author

Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, Jun Zhu

Outline

VIDAR is a two-stage framework that uses large-scale video-based pre-training and a novel mask dynamics model to address data scarcity and entity heterogeneity issues to increase the scalability of dual-robot manipulation. Using 750K multi-view videos, we pre-train a video diffusion model, and a mask dynamics model that extracts action-relevant information from masks without pixel-wise labels. We demonstrate that it generalizes well to new tasks and backgrounds with only 20 minutes of human demonstrations (1% of the typical data requirement) on a novel robotic platform.

Takeaways, Limitations

Takeaways:
By combining large-scale video-based pre-learning with a mask dynamics model, we significantly improve the scalability and generalization performance of dual robot manipulation.
It suggests the possibility of building a robot manipulation system that can adapt to various tasks and backgrounds even with a small amount of data.
We demonstrate the applicability of video-based foundation models to the field of robotic manipulation.
Limitations:
Since we currently use a dataset limited to three real-world dual robot platforms, validation of generalization performance to more diverse robot platforms and environments is needed.
There is a lack of detailed description of the learning process and mask generation mechanism of the mask dynamics model.
Performance evaluation is needed for more challenging tasks involving long-term tasks or complex interactions.
👍