This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Exploring Reinforcement Learning with Verifiable Reward (RLVR): Expanding Depth and Breadth
Outline
This paper explores two unexplored dimensions: depth (the most difficult problem the model can sample) and width (the number of instances used in a single iteration) to improve the inference capability of large-scale language models in Reinforcement Learning with Verifiable Reward (RLVR). By analyzing the bias of the GRPO algorithm, we propose Difficulty Adaptive Rollout Sampling (DARS) to address the depth issue, and to expand width, we increase the batch size and perform full batch updates. DARS-B simultaneously expands both depth and width, improving Pass@K and Pass@1 performance.
Takeaways, Limitations
•
Takeaways:
◦
Increase the number of positive rollouts for difficult problems through DARS, thereby solving problems in depth and improving Pass@K performance.
◦
We significantly improve Pass@1 performance by expanding the width through large-batch training.
◦
DARS-B simultaneously extends depth and width, improving both Pass@K and Pass@1 performance.
◦
We demonstrate that depth and width are independent dimensions that contribute to improving the inference capability of RLVR.
•
Limitations:
◦
The specific Limitations is not specified in the paper.