This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Created by
Haebom
Author
Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang
Outline
In this paper, we propose a Vision-Language-Action (VLA) model training method using video data captured from a human point of view to overcome the scale limitation of collecting real robot data in imitation learning for robot manipulation. We train a VLA model by utilizing rich scene and task information of human video data, and convert human actions into robot actions through inverse kinematics and target retargeting. We fine-tune the model using a small number of robot manipulation demonstrations to obtain a robot policy called EgoVLA, and evaluate EgoVLA on a simulation benchmark called Ego Humanoid Manipulation Benchmark that includes various bimanual manipulation tasks. As a result, we demonstrate improved performance compared to existing methods, proving the importance of human data.
Takeaways, Limitations
•
Takeaways:
◦
Presenting a strategy for utilizing large-scale human video data that can overcome the limitations of collecting real robot data
◦
Presenting an effective method to convert human behavior into robot behavior (inverse kinematics and target retargeting)
◦
A new simulation benchmark (Ego Humanoid Manipulation Benchmark) covering a variety of bimanual manipulation tasks is presented.
◦
Improved performance over existing methods and demonstrated the importance of human data
•
Limitations:
◦
Since the evaluation results are from a simulation environment, performance in an actual robot environment requires additional verification.
◦
There is a possibility of performance degradation due to differences between human and robot behavior.
◦
Further research is needed on the generalizability of the Ego Humanoid Manipulation Benchmark