Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

Created by
  • Haebom

Author

Yibin Liu, Zhixuan Liang, Zanxin Chen, Tianxing Chen, Mengkang Hu, Wanxi Dong, Congsheng Xu, Zhaoming Han, Yusen Qin, Yao Mu

Outline

This paper discusses recent advances in multimodal large-scale language models (MLLMs), which enable rich perceptual evidence for code policy generation in embodied agents. Most existing systems lack effective mechanisms for adaptively monitoring policy execution and recovering code during task completion. This study introduces HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric evidence, perceptual monitoring, and iterative recovery into the closed-loop programming cycle of embodied agents. Given a natural language instruction, the system first decomposes it into subgoals and generates an initial executable program based on object-oriented geometric primitives. Then, while the program executes in simulation, a vision-language model (VLM) observes selected checkpoints to detect, localize, and infer the cause of execution failures. By integrating structured execution traces that capture program-level events with VLM-based perceptual feedback, HyCodePolicy infers the cause of failures and recovers the program. This hybrid dual-feedback mechanism enables self-correcting program synthesis with minimal human supervision. Experimental results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, providing a scalable strategy for integrating multimodal inference into autonomous decision-making pipelines.

Takeaways, Limitations

Takeaways:
We present HyCodePolicy, a novel framework that leverages multimodal inference to improve the robustness and sample efficiency of robot manipulation policies.
Implementing a closed-loop programming cycle that integrates code synthesis, geometric rationale, perceptual monitoring, and iterative recovery.
Self-correcting program synthesis possible through a hybrid dual feedback mechanism that combines VLM-based perceptual feedback and program-level event tracking.
Providing a scalable strategy for integrating multimodal inference into autonomous decision-making pipelines.
Limitations:
The performance of HyCodePolicy may depend on the performance of the VLM and other components used.
May have limited ability to handle complex or unexpected failure situations.
Performance in a simulated environment does not guarantee generalizability to real-world environments.
Consideration should be given to additional constraints and issues that may arise when applied to actual robotic systems.
👍