To address the key bottleneck of reliance on large-scale body interaction data, this paper proposes Primitive Embodied World Models (PEWM), a novel world modeling paradigm focused on limited, short-term time horizons. By constraining video generation to a fixed, short-term time horizon, PEWM enables fine-grained alignment between linguistic concepts and visual representations of robot motions, reducing training complexity, improving data efficiency of body data collection, and reducing inference latency. Equipped with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap guidance mechanism (SGG), it enables flexible closed-loop control and supports constructive generalization of primitive-level policies to complex tasks. Leveraging the spatiotemporal visual priors of video models and the semantic understanding of VLMs, it bridges the gap between fine-grained physical interaction and high-level inference, paving the way toward scalable, interpretable, and general-purpose body intelligence.