Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis

Created by
  • Haebom

Author

Xiaowei Chi, Kuangzhi Ge, Jiaming Liu, Siyuan Zhou, Peidong Jia, Zichen He, Yuzhen Liu, Tingguang Li, Lei Han, Sirui Han, Shanghang Zhang, Yike Guo

Outline

This paper proposes Manipulate in Dream (MinD), a dual-system world model for real-time hazard-aware planning. MinD utilizes two asynchronous diffusion processes: a low-frequency vision generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions. The core idea is that the robot policy can rely on low-resolution latent variables generated in a single denoising step, rather than requiring fully denoised frames. To link the initial predictions to actions, we introduce DiffMatcher, a video-action alignment module with a novel joint learning strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL-Bench and a 60% success rate on the real Franka task, operating at 11.3 frames per second, demonstrating the effectiveness of single-step latent variable features for control signaling. Furthermore, MinD preemptively identifies 74% of potential task failures, providing real-time safety signals for monitoring and intervention. This study presents a new paradigm for efficient and reliable robot manipulation using generative world models.

Takeaways, Limitations

Takeaways:
Presenting the possibility of efficient real-time robot manipulation using single-step latent variable features.
Risk prediction and safety improvement using generative models.
Performance verification through RL-Bench and real robot experiments (high success rate achieved).
Effective synchronization of two diffusion models via the video-action alignment module (DiffMatcher).
Limitations:
Further research is needed on the generalization performance of the proposed model.
Applicability to various environments and tasks needs to be verified.
Further research is needed to address the complexity and uncertainty of real-world environments.
Lack of detailed explanation on parameter adjustment of LoDiff and HiDiff.
👍