Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Created by
  • Haebom

Author

Yixuan Even Xu, Yash Savani, Fei Fang, J. Zico Kolter

PODS (Policy Optimization with Down-Sampling)

Outline

While research on improving the inference performance of large-scale language models using reinforcement learning and verifiable rewards has largely focused on improving the performance of these models, there is a fundamental asymmetry: rollout generation is parallel and memory-efficient, while policy updates are communication-intensive and memory-intensive. In this paper, we propose PODS, which decouples rollout generation and policy updates. PODS trains using only a strategically selected subset of rollouts, thereby dramatically reducing update costs while maintaining learning quality. We propose a principled subset selection criterion based on maximum variance downsampling that maximizes reward diversity and provide an efficient O(n log n) implementation. Experimental results demonstrate that Group Relative Policy Optimization (GRPO) with PODS achieves state-of-the-art test accuracies at least 1.7x faster than conventional GRPO across a variety of inference benchmarks and hardware configurations.

Takeaways, Limitations

Takeaways:
Improves the training efficiency of reinforcement learning-based language models by significantly reducing policy update costs.
Maintain learning quality by effectively selecting a rollout subset through maximum variance downsampling technique.
Experiments have proven superior performance compared to existing methods in various environments.
Limitations:
Further research is needed to determine generalizability, focusing on specific benchmarks and hardware environments.
Further exploration is needed to optimize the subset selection criteria and improve performance.
Research is needed to determine whether PODS can be applied to other reinforcement learning algorithms and model architectures.
👍