Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Created by
  • Haebom

Author

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang

Outline

This paper analyzes depth and breadth, two key factors for improving the inference performance of language models in reinforcement learning-based verifiable reward learning (RLVR). We point out that the existing GRPO algorithm, with its Limitations, overweights samples with medium accuracy and underweights low-accuracy samples, which are crucial for improving inference performance. To address this, we propose Difficulty Adaptive Rollout Sampling (DARS), a technique that rebalances weights through multi-stage rollouts on difficult problems. Furthermore, we present a method to expand the breadth of the training data by significantly increasing the batch size and using full-batch updates across multiple epochs instead of the mini-batch iterations of PPO. Finally, we propose DARS-B, which combines DARS with large batch sizes, and experimentally demonstrate that depth and breadth independently contribute to improving inference performance in RLVR.

Takeaways, Limitations

Takeaways:
The importance of depth and breadth in the GRPO algorithm is revealed.
Improving the inference performance of RLVR through the DARS technique that increases exploration of difficult problems.
Additional performance improvements through breadth expansion using larger batch sizes.
DARS-B simultaneously improves depth and width, improving both Pass@K and Pass@1 performance.
Experimentally demonstrating that depth and width operate independently in RLVR.
Limitations:
The effectiveness of the proposed method may be limited to specific RLVR settings and datasets.
Increased computational cost due to using large batch sizes.
Additional experiments on more diverse problem types and datasets are needed.
👍