In this paper, we propose VIPER, a novel framework for visually guided planning. VIPER integrates perception based on a Vision-Language Model (VLM) and inference based on a Large Language Model (LLM). It uses a modular pipeline where the VLM generates textual descriptions of image observations, and the LLM policy predicts actions based on the task objective. We fine-tune the inference module using action replication and reinforcement learning to enhance the agent's decision-making ability. Experimental results on the ALFWorld benchmark demonstrate that VIPER significantly outperforms existing state-of-the-art visually guided planning and narrows the performance gap with purely text-based oracles. By leveraging text as an intermediate representation, we enhance explainability and enable detailed analysis of the perception and inference components.