Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

Created by
  • Haebom

Author

Beichen Wang, Juexiao Zhang, Shuwen Dong, Irving Fang, Chen Feng

Outline

This paper proposes SeeDo, a novel method for interpreting human demonstration videos and generating robot task plans using a Vision Language Model (VLM). SeeDo is a pipeline that integrates keyframe selection, visual recognition, and VLM inference. It allows a robot to perform a task plan by viewing a human demonstration video (See) and then explaining the plan to the robot (Do). We construct a dataset of diverse pick-and-place task demonstration videos and experimentally validate the superior performance of SeeDo by comparing it with several state-of-the-art video-input VLM-based baseline models. We deploy the generated task plans in simulation environments and on a real robot arm.

Takeaways, Limitations

Takeaways:
We present a novel approach to generating robot task plans from human demonstration videos using VLM.
Building an effective pipeline integrating keyframe selection, visual recognition, and VLM inference.
Proven performance in a variety of tasks and environments.
Successful deployment in simulation and real robotic environments.
Limitations:
Further research on generalization performance is needed, with experiments on a limited set of pick-and-place tasks.
Scalability verification is needed for more complex and diverse tasks.
There is a need to improve the robustness of VLM to interpretation errors.
Consideration of noise and uncertainty in real environments is necessary.
👍