Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Created by
  • Haebom

Author

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li

Outline

This paper introduces the Robots Imitating Generated Videos (RIGVid) system. RIGVid enables robots to perform complex manipulation tasks such as pouring, wiping, and mixing by imitating AI-generated videos without physical demonstrations or robot-specific training. Given a verbal command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out non-command-following results. A 6D pose tracker then extracts object trajectories from the videos, which are then reassigned to the robot, regardless of implementation. Extensive real-world evaluations show that the filtered generation videos are as effective as real demonstrations, and that performance improves as the generation quality improves. We also show that relying on the generation videos outperforms more concise alternatives such as keypoint prediction using the VLM, and that robust 6D pose tracking outperforms other trajectory extraction methods such as dense feature tracking. These results suggest that videos generated by state-of-the-art commercial models can be an effective source of supervision for robot manipulation.

Takeaways, Limitations

Takeaways:
Utilizing AI-generated videos, we present new possibilities for learning how to manipulate robots.
No physical demonstrations or robot-specific training are required, saving learning costs and time.
We demonstrate that improving the quality of generated videos leads to improved robot manipulation performance.
Emphasizes the importance of 6D pose tracking.
Limitations:
It depends on the quality of AI-generated videos, and limitations of the generative model may affect the robot manipulation performance.
Further research is needed on generalization performance across different environments and tasks.
Since the accuracy of 6D pose tracking significantly affects performance, countermeasures for tracking failures may be required.
It is currently limited to certain types of manipulation tasks, and research is needed to expand it to a wider range of tasks.
👍