This paper presents the Visual Language World Model (VLWM). VLWM is a foundational model trained for language-based world modeling based on natural image data. Based on visual observations, VLWM first infers overall goal achievement and then predicts trajectories where actions and world state changes intersect. These goals are extracted through iterative LLM self-improvement conditioned on compressed future observations represented as caption trees. VLWM learns both action policies and dynamic models, enabling reactive System-1 plan decoding and reflective System-2 planning via cost minimization, respectively. Costs are measured by a self-supervised, trained evaluator model, evaluating the semantic distance between the hypothetical future states generated by VLWM development and the expected goal states. VLWM achieves state-of-the-art visual planning assistance (VPA) performance on both benchmark evaluations and the proposed PlannerArena human evaluation, with System-2 improving Elo scores by 27% compared to System-1. It also outperforms the powerful VLM baseline model on the RoboVQA and WorldPrediction benchmarks.