This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
Outline
This paper introduces Group Sequence Policy Optimization (GSPO), a robust, efficient, and high-performance reinforcement learning algorithm for training large-scale language models. Unlike existing algorithms that adopt token-level importance ratios, GSPO defines importance ratios based on sequence likelihood and performs sequence-level clipping, compensation, and optimization. GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, particularly stabilizing Mixed-of-Experts (MoE) RL training, and demonstrating its potential to simplify RL infrastructure design. These advantages of GSPO contributed to the remarkable performance improvement of the state-of-the-art Qwen3 model.
Takeaways, Limitations
•
Takeaways:
◦
We present GSPO, a stable and efficient reinforcement learning algorithm for large-scale language model training.
◦
Achieving improved training efficiency and performance over the GRPO algorithm through sequence-level optimization.
◦
Contribute to stabilizing MoE RL training.
◦
Presenting the possibility of simplifying RL infrastructure design.
◦
Contributes to improving the performance of the latest Qwen3 model.
•
Limitations:
◦
This paper does not explicitly address the GSPO algorithm's Limitations. Further experiments and analysis are needed to elucidate Limitations.