While research on improving the inference performance of large-scale language models using reinforcement learning and verifiable rewards has largely focused on improving the performance of these models, there is a fundamental asymmetry: rollout generation is parallel and memory-efficient, while policy updates are communication-intensive and memory-intensive. In this paper, we propose PODS, which decouples rollout generation and policy updates. PODS trains using only a strategically selected subset of rollouts, thereby dramatically reducing update costs while maintaining learning quality. We propose a principled subset selection criterion based on maximum variance downsampling that maximizes reward diversity and provide an efficient O(n log n) implementation. Experimental results demonstrate that Group Relative Policy Optimization (GRPO) with PODS achieves state-of-the-art test accuracies at least 1.7x faster than conventional GRPO across a variety of inference benchmarks and hardware configurations.