This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper aims to provide human-like thinking capabilities to robot-based models, taking inspiration from the way humans perform slow thinking before taking actual actions when performing complex tasks in the physical world. To this end, we propose Hume, a dual-system vision-language-action (VLA) model with value-based System 2 thinking and cascading action denoising. System 2 of Hume uses a novel value query head to estimate the state-action value of predicted actions, and implements value-guided thinking that repeatedly samples multiple action candidates and selects one based on the state-action value. System 1 is a lightweight reactive visual-motor policy that receives the actions selected by System 2 and performs cascading action denoising for dexterous robot control. During deployment, System 2 performs value-guided thinking at low frequency, and System 1 asynchronously receives the action candidates selected by System 2 and predicts fluid actions in real time. Experimental results show that Hume outperforms existing state-of-the-art VLA models on several simulation benchmarks and real robot deployments.
Takeaways, Limitations
•
Takeaways:
◦
Applying the slow human thought process to robot control to improve the ability to perform complex tasks.
◦
Value-driven thinking enables efficient action selection and planning.
◦
Maintaining a balance between real-time performance and planning capabilities through a dual system architecture of System 1 and System 2.
◦
Excellent performance verification in various simulation and real robot environments.
•
Limitations:
◦
Further research is needed on the effectiveness of learning the value function of the proposed model and its generalization performance.
◦
Further generalization performance evaluations across diverse and complex task environments are needed.
◦
There may be limits to perfectly handling the complexity and uncertainty of the real world.
◦
Potential delay issues due to low frequency operation of System 2.