Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Created by
  • Haebom

Author

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang

Exploring Reinforcement Learning with Verifiable Reward (RLVR): Expanding Depth and Breadth

Outline

This paper explores two unexplored dimensions: depth (the most difficult problem the model can sample) and width (the number of instances used in a single iteration) to improve the inference capability of large-scale language models in Reinforcement Learning with Verifiable Reward (RLVR). By analyzing the bias of the GRPO algorithm, we propose Difficulty Adaptive Rollout Sampling (DARS) to address the depth issue, and to expand width, we increase the batch size and perform full batch updates. DARS-B simultaneously expands both depth and width, improving Pass@K and Pass@1 performance.

Takeaways, Limitations

Takeaways:
Increase the number of positive rollouts for difficult problems through DARS, thereby solving problems in depth and improving Pass@K performance.
We significantly improve Pass@1 performance by expanding the width through large-batch training.
DARS-B simultaneously extends depth and width, improving both Pass@K and Pass@1 performance.
We demonstrate that depth and width are independent dimensions that contribute to improving the inference capability of RLVR.
Limitations:
The specific Limitations is not specified in the paper.
👍