This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper proposes Manipulate in Dream (MinD), a dual-system world model for real-time hazard-aware planning. MinD utilizes two asynchronous diffusion processes: a low-frequency vision generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions. The core idea is that the robot policy can rely on low-resolution latent variables generated in a single denoising step, rather than requiring fully denoised frames. To link the initial predictions to actions, we introduce DiffMatcher, a video-action alignment module with a novel joint learning strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL-Bench and a 60% success rate on the real Franka task, operating at 11.3 frames per second, demonstrating the effectiveness of single-step latent variable features for control signaling. Furthermore, MinD preemptively identifies 74% of potential task failures, providing real-time safety signals for monitoring and intervention. This study presents a new paradigm for efficient and reliable robot manipulation using generative world models.
Takeaways, Limitations
•
Takeaways:
◦
Presenting the possibility of efficient real-time robot manipulation using single-step latent variable features.
◦
Risk prediction and safety improvement using generative models.
◦
Performance verification through RL-Bench and real robot experiments (high success rate achieved).
◦
Effective synchronization of two diffusion models via the video-action alignment module (DiffMatcher).
•
Limitations:
◦
Further research is needed on the generalization performance of the proposed model.
◦
Applicability to various environments and tasks needs to be verified.
◦
Further research is needed to address the complexity and uncertainty of real-world environments.
◦
Lack of detailed explanation on parameter adjustment of LoDiff and HiDiff.