Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Created by
  • Haebom

Author

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu

Outline

Vidar aims to extend generalized manipulation capabilities to novel robotic platforms. This research presents a low-capacity adaptation paradigm that replaces most platform-specific data with transferable video prior information. Vidar consists of a video diffusion model implemented with generalizable prior information and a masked dynamics model (MIDM) adapter based on core separation of policies. The video diffusion model, pretrained on internet-scale videos, is domain-adapted to 750K multi-view trajectories on three real-world robotic platforms using a unified observation space that integrates robot, camera, task, and scene context. The MIDM module learns dense, label-free action-related pixel masks to map the prior information to the target platform's action space while suppressing distractors. This research uses generative video prior information to implicitly capture affordances, contact dynamics, and physical coherence from large-scale, unlabeled videos, modeling the distribution of plausible and temporally consistent interactions. Vidar outperforms existing VLA-based models with only 20 minutes of human demonstration on a novel robot and generalizes well to unseen tasks, backgrounds, and camera layouts.

Takeaways, Limitations

We present a scalable “one dictionary, many platforms” approach with robust, low-cost video dictionary information and minimal robotic alignment.
It reduces the need for large-scale data collection for new robots, enabling efficient adaptation.
Generalized with invisible tasks, backgrounds and camera layouts.
The specific Limitations is not specified in the paper.
👍