Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

Created by
  • Haebom

Author

Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, Kai Xu

Outline

This paper focuses on predictive manipulation, which leverages predicted states to improve robot policy performance. To address the difficulty of existing world models in accurately generating future visual states of robot-object interactions, particularly at the pixel-level, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. LaDi-WM incorporates both geometric (DINO-based) and semantic (CLIP-based) features by leveraging pre-trained visual-based models (VFMs) and aligned latent spaces. We demonstrate that predicting latent space changes facilitates learning and generalizes better than direct pixel-level image prediction. Based on LaDi-WM, we design a diffusion policy that iteratively improves output behavior by incorporating predicted states, resulting in more consistent and accurate results. Extensive experiments on synthetic and real-world benchmarks demonstrate that LaDi-WM improves policy performance by 27.9% on the LIBERO-LONG benchmark and 20% in real-world scenarios, achieving impressive generalization performance even in real-world experiments.

Takeaways, Limitations

Takeaways:
A world model based on latent space prediction using diffusion modeling, LaDi-WM, is proposed and proven to be more efficient and have better generalization performance than pixel-by-pixel prediction.
We show that the accuracy and consistency of robot manipulation can be improved by using a diffusion policy that utilizes predicted states.
We achieved remarkable performance improvements in LIBERO-LONG and real-world environments.
We have demonstrated excellent generalization performance in real-world environments.
Limitations:
The performance of LaDi-WM may depend on the performance of the VFM used.
There is a possibility that it may not perfectly reflect the complexity and diversity of real environments.
Due to limitations of the benchmark, further validation of generalization performance may be required.
Computational costs may be high (although not explicitly stated, the nature of the diffusion model suggests that the computational cost is likely to be high).
👍