Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Created by
  • Haebom

Author

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang

Reinforcement Learning with Verifiable Reward: Depth and Breadth

Outline

This paper focuses on two unexplored aspects of Reinforcement Learning with Verifiable Reward (RLVR): depth (difficult problem sampling) and breadth (the number of instances used in a single iteration) to overcome the limitations of RLVR. We analyze the bias of the GRPO algorithm and propose Difficulty Adaptive Rollout Sampling (DARS) to address the depth-ignoring problem. Furthermore, we extend the breadth of the training data to achieve improved performance. DARS-B, which combines DARS and breadth, demonstrates simultaneous improvements in both Pass@K and Pass@1.

Takeaways, Limitations

Takeaways:
Improved performance by improving sampling for difficult problems through DARS.
Improving inference capability by expanding the breadth of training data.
DARS and Breadth are two independent factors that are important in improving the inference ability of RLVR.
Improvements in both Pass@K and Pass@1 via DARS-B.
Limitations:
Based on bias analysis of the GRPO algorithm.
Specific implementation details for DARS and Breadth extensions may be limited.
This may be a result for a specific algorithm and problem, and further research is needed to determine generalizability.
👍