This paper presents a test-time scaling technique to improve the robustness of the Vision-Language-Action (VLA) model in unstructured real-world environments. We study how to improve the robustness and generalization performance of VLA through sampling and validation, and show that the relationship between the action error and the number of generated samples follows an exponential power law. Based on this, we propose RoboMonkey, a test-time scaling framework for VLA. RoboMonkey generates multiple action samples from VLA, adds Gaussian noise, generates an action proposal distribution through majority voting, and then selects the optimal action using a VLM-based verifier. We train a VLM-based action verifier through a synthetic data generation pipeline, and demonstrate the performance improvement of VLA using RoboMonkey through simulations and hardware experiments. The experimental results show an absolute performance improvement of 25% on out-of-distribution tasks and 9% on in-distribution tasks, and show that fine-tuning the VLA and action verifier together improves the performance by 7% compared to fine-tuning the VLA alone when adapting to a new robot setting.