This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Created by
Haebom
Author
Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang
Outline
In this paper, we propose EgoVLA, a Vision-Language-Action (VLA) model that utilizes video data captured from a human point of view to overcome the limitations of collecting real robot data in imitation learning for robot manipulation. We train the VLA model by utilizing rich scene and task information of human video data, and convert human actions into robot actions through inverse kinematics and retargeting. We fine-tune the model using a small number of robot manipulation demonstrations, and evaluate its performance on various bimanual manipulation tasks through a simulation benchmark called 'Ego Humanoid Manipulation Benchmark', demonstrating that it outperforms existing methods.
Takeaways, Limitations
•
Takeaways:
◦
Improving the efficiency of robot manipulation imitation learning through large-scale utilization of human video data.
◦
Improved generalization performance across a variety of scenes and tasks.
◦
Effective translation of human behavior into robotic behavior through inverse kinematics and retargeting.
◦
Introducing a new simulation benchmark, the Ego Humanoid Manipulation Benchmark.
•
Limitations:
◦
Potential for reduced accuracy due to differences between human and robot behavior.
◦
Verification of generalization performance in real robot environments is needed.
◦
Limitations in generalization performance due to limitations of the Ego Humanoid Manipulation Benchmark.