To address the challenges of evaluating robot control policies, the authors propose WorldGym, an autoregressive, action-conditioned video generation model that acts as a proxy for real-world environments. WorldGym evaluates policies through Monte Carlo rollout, with a vision-language model providing rewards. Using only the initial frames of real robots, they evaluate WorldGym on a set of VLA-based real-world robot policies and demonstrate that the policy success rates within WorldGym are highly correlated with the actual success rates. Furthermore, they demonstrate that WorldGym maintains relative policy rankings across different policy versions, sizes, and training checkpoints. Because WorldGym requires only a single starting frame, it efficiently evaluates the generalization ability of robot policies to new tasks and environments.