Large-scale vision-language models (LVLMs) are promising for implementation planning tasks, but they struggle in complex scenarios involving unfamiliar environments and multi-stage goals. Current approaches rely on environment-independent imitation learning to separate instructions from environmental context, leaving models struggling with context-sensitive instructions and relying on auxiliary cues rather than visual reasoning during long-term interactions. In this study, we propose a framework called World Awareness Planning Narrative Enhancement (WAP). This framework instills comprehensive environmental understanding into LVLMs through four cognitive abilities (visual appearance modeling, spatial reasoning, functional abstraction, and syntactic rationale), while developing and evaluating models using only raw visual observations through curriculum learning. Evaluation on the EB-ALFRED benchmark demonstrates significant improvements in task success rates, with Qwen2.5-VL achieving an absolute improvement of 60.7 in common sense reasoning (+60.0) and long-term planning (+70.0). In particular, the improved open-source model outperforms proprietary systems such as GPT-4o and Claude-3.5-Sonnet by a large margin.