This paper focuses on predictive manipulation, which leverages predicted states to improve robot policy performance. To address the challenges of generating accurate future visual states, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. LaDi-WM utilizes a pre-trained visual basis model (VFM) that includes geometric (DINO-based) and semantic (CLIP-based) features, aligned with the latent space, resulting in easier learning and superior generalization performance compared to direct pixel-level image prediction. Based on LaDi-WM, we design a diffusion policy that iteratively improves output behavior by incorporating predicted states, generating more consistent and accurate results. Extensive experiments on synthetic and real-world benchmarks demonstrate that LaDi-WM improves performance by 27.9% on the LIBERO-LONG benchmark and 20% in real-world scenarios, achieving excellent generalization performance in real-world experiments.