Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Hume: Introducing System-2 Thinking in Visual-Language-Action Model

Created by
  • Haebom

Author

Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, Dong Wang, Xuelong Li

Outline

This paper aims to provide human-like thinking capabilities to robot-based models, taking inspiration from the way humans perform slow thinking before taking actual actions when performing complex tasks in the physical world. To this end, we propose Hume, a dual-system vision-language-action (VLA) model with value-based System 2 thinking and cascading action denoising. System 2 of Hume uses a novel value query head to estimate the state-action value of predicted actions, and implements value-guided thinking that repeatedly samples multiple action candidates and selects one based on the state-action value. System 1 is a lightweight reactive visual-motor policy that receives the actions selected by System 2 and performs cascading action denoising for dexterous robot control. During deployment, System 2 performs value-guided thinking at low frequency, and System 1 asynchronously receives the action candidates selected by System 2 and predicts fluid actions in real time. Experimental results show that Hume outperforms existing state-of-the-art VLA models on several simulation benchmarks and real robot deployments.

Takeaways, Limitations

Takeaways:
Applying the slow human thought process to robot control to improve the ability to perform complex tasks.
Value-driven thinking enables efficient action selection and planning.
Maintaining a balance between real-time performance and planning capabilities through a dual system architecture of System 1 and System 2.
Excellent performance verification in various simulation and real robot environments.
Limitations:
Further research is needed on the effectiveness of learning the value function of the proposed model and its generalization performance.
Further generalization performance evaluations across diverse and complex task environments are needed.
There may be limits to perfectly handling the complexity and uncertainty of the real world.
Potential delay issues due to low frequency operation of System 2.
👍