Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Survey of Reinforcement Learning for Large Reasoning Models

Created by
  • Haebom

Author

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou

Outline

This paper examines recent advances in reinforcement learning (RL) for improving the inference capabilities of large-scale language models (LLMs). RL has demonstrated remarkable success in improving the performance of LLMs, particularly in complex logical tasks such as mathematics and coding, and has become a fundamental methodology for converting LLMs into inference models (LRMs). However, despite rapid progress in RL, extending RL to LLMs and LRMs faces fundamental challenges not only in terms of computational resources but also in terms of algorithm design, training data, and infrastructure. Therefore, it is timely to revisit the field's progress, reevaluate its trajectory, and explore strategies for increasing the scalability of RL toward artificial superintelligence (ASI). Specifically, following the release of DeepSeek-R1, we examine research on applying RL to LLMs and LRMs for inference, examining the underlying components, key challenges, training resources, and subsequent applications to identify future opportunities and directions in this rapidly evolving field. We hope this paper will stimulate future research on RL for a broader range of inference models.

Takeaways, Limitations

Takeaways: We demonstrate the utility of RL in improving the inference capabilities of LLMs and emphasize its importance as a fundamental methodology for LRM development. We analyze research trends since DeepSeek-R1 and suggest future research directions. We expect this to contribute to stimulating RL research for a broader range of inference models.
Limitations: A thorough analysis of RL's scalability limitations (computational resources, algorithm design, training data, infrastructure) may be lacking. A concrete roadmap for the applicability of RL to achieve ASI may not be presented. The focus may be on general overviews rather than detailed descriptions of specific algorithms or models, and may lack detailed technical details.
👍