Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Group Sequence Policy Optimization

Created by
  • Haebom

Author

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin

Outline

This paper introduces Group Sequence Policy Optimization (GSPO), a robust, efficient, and high-performance reinforcement learning algorithm for training large-scale language models. Unlike existing algorithms that adopt token-level importance ratios, GSPO defines importance ratios based on sequence likelihood and performs sequence-level clipping, compensation, and optimization. GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, particularly stabilizing Mixed-of-Experts (MoE) RL training, and demonstrating its potential to simplify RL infrastructure design. These advantages of GSPO contributed to the remarkable performance improvement of the state-of-the-art Qwen3 model.

Takeaways, Limitations

Takeaways:
We present GSPO, a stable and efficient reinforcement learning algorithm for large-scale language model training.
Achieving improved training efficiency and performance over the GRPO algorithm through sequence-level optimization.
Contribute to stabilizing MoE RL training.
Presenting the possibility of simplifying RL infrastructure design.
Contributes to improving the performance of the latest Qwen3 model.
Limitations:
This paper does not explicitly address the GSPO algorithm's Limitations. Further experiments and analysis are needed to elucidate Limitations.
👍