This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Created by
Haebom
Author
Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn
Outline
This paper presents a versatile robotic system that can perform a variety of tasks in an open environment. The system is capable of processing complex instructions, prompts, and feedback, and planning step-by-step tasks. It uses a hierarchical vision-language model to parse complex commands and user feedback, infer the most appropriate next step, and then performs that step as a low-level action. Unlike direct command execution that performs simple commands (“Pick up the cup”), the system can understand complex prompts and integrate context-sensitive feedback (“That’s not trash”) during task execution. The system’s ability to perform tasks such as table clearing, sandwich making, and grocery shopping is evaluated on three robotic platforms: a single-arm, dual-arm, and dual-arm mobile robot.
Takeaways, Limitations
•
Takeaways:
◦
Demonstrates the potential to build robotic systems that can process complex language commands and contextual feedback.
◦
Experimentally verified its applicability on various robot platforms.
◦
Efficient task performance through hierarchical utilization of vision-language models.
•
Limitations:
◦
Further analysis is needed on the generalizability and robustness of the system presented in the paper.
◦
Scalability verification for various environments and tasks is required.
◦
Further research is needed on the ability to handle unexpected situations or errors.