Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Created by
  • Haebom

Author

Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li

Outline

This paper highlights that Reinforcement Learning with Verifiable Rewards (RLVR) has improved the complex inference capabilities of large-scale language models (LLMs). However, due to its inherent on-policy strategy, the LLM's vast action space, and sparse rewards, it struggles to overcome the inherent limitations of the LLM. Furthermore, RLVR can cause the collapse of the LLM's capability boundary, narrowing the LLM's problem-solving scope. To address this, this paper proposes RL-PLUS, a novel hybrid policy optimization approach that synergistically combines internal and external data to achieve stronger inference capabilities and overcome the limitations of the underlying model. RL-PLUS integrates two key components: multi-importance sampling to address the distributional mismatch of external data, and an exploration-based advantage function to guide the model along high-value, unexplored inference paths. Through theoretical analysis and extensive experiments, this paper demonstrates the superiority and generalizability of the proposed approach.

Takeaways, Limitations

Takeaways:
RL-PLUS achieves state-of-the-art performance on six mathematical inference benchmarks, outperforming existing RLVR methods.
It showed excellent performance on six out-of-distribution inference tasks.
We observed consistent and significant performance improvements across various model families, with average relative improvements reaching up to 69.2%.
RL-PLUS effectively solves the capability boundary collapse problem.
Limitations:
The paper does not explicitly mention the Limitations of RL-PLUS. Further research is needed to elucidate the specific Limitations. For example, further analysis of the effectiveness of multi-importance sampling and search-based advantage functions may be necessary. Furthermore, there may be limitations in generalizability to certain types of problems or LLM architectures.
👍