This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Reinforcement Learning with Verifiable Reward: Depth and Breadth
Outline
This paper focuses on two unexplored aspects of Reinforcement Learning with Verifiable Reward (RLVR): depth (difficult problem sampling) and breadth (the number of instances used in a single iteration) to overcome the limitations of RLVR. We analyze the bias of the GRPO algorithm and propose Difficulty Adaptive Rollout Sampling (DARS) to address the depth-ignoring problem. Furthermore, we extend the breadth of the training data to achieve improved performance. DARS-B, which combines DARS and breadth, demonstrates simultaneous improvements in both Pass@K and Pass@1.
Takeaways, Limitations
•
Takeaways:
◦
Improved performance by improving sampling for difficult problems through DARS.
◦
Improving inference capability by expanding the breadth of training data.
◦
DARS and Breadth are two independent factors that are important in improving the inference ability of RLVR.
◦
Improvements in both Pass@K and Pass@1 via DARS-B.
•
Limitations:
◦
Based on bias analysis of the GRPO algorithm.
◦
Specific implementation details for DARS and Breadth extensions may be limited.
◦
This may be a result for a specific algorithm and problem, and further research is needed to determine generalizability.