This paper focuses on predictive manipulation, which leverages predicted states to improve robot policy performance. To address the difficulty of existing world models in accurately generating future visual states of robot-object interactions, particularly at the pixel-level, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. LaDi-WM incorporates both geometric (DINO-based) and semantic (CLIP-based) features by leveraging pre-trained visual-based models (VFMs) and aligned latent spaces. We demonstrate that predicting latent space changes facilitates learning and generalizes better than direct pixel-level image prediction. Based on LaDi-WM, we design a diffusion policy that iteratively improves output behavior by incorporating predicted states, resulting in more consistent and accurate results. Extensive experiments on synthetic and real-world benchmarks demonstrate that LaDi-WM improves policy performance by 27.9% on the LIBERO-LONG benchmark and 20% in real-world scenarios, achieving impressive generalization performance even in real-world experiments.