This paper highlights the importance of predictive capabilities for general-purpose systems that plan or act in the world at various levels of abstraction. The researchers found a strong correlation between the perceptual capabilities of vision models and their short-term predictive performance. This trend is demonstrated across a variety of pretrained models, including generatively trained models, and across multiple levels of abstraction, from raw pixels to depth, point tracking, and object motion. This paper presents a novel general-purpose prediction framework that works on any fixed vision backbone. The framework learns a latent diffusion model to predict future features in a fixed representation space and decodes it using lightweight, task-specific decoding. For consistent evaluation across tasks, we apply the framework to nine models and four tasks, introducing a distribution metric that compares distributional features directly in the subtask space. The results highlight the value of bridging representation learning and generative modeling for temporally based video understanding.