This paper focuses on action-based decision-making in open-world environments and presents "Act from Visual Language Post-Training," a novel approach for improving the performance of a Visual Language Action (VLA) model pre-trained on a large-scale web dataset. Unlike previous studies that primarily focus on action post-training, this study enhances the baseline performance of the VLM itself through self-supervised post-training using visual and verbal guidance. This enhances world knowledge, visual perception, and spatial understanding in open-world environments. We present the first VLA model to perform over 1,000 different atomic operations (e.g., crafting, smelting, cooking, mining, and killing) in the Minecraft environment. Post-training on non-trajectory operations yields a 40% performance improvement over the best-performing existing agents. Furthermore, we achieve state-of-the-art performance, outperforming existing imitation learning-based policies. We encourage further research by making the code, model, and dataset publicly available.