Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Single-stream Policy Optimization

Created by
  • Haebom

Author

Zhongwen Xu, Zihan Ding

Outline

This paper revisits policy gradient optimization for large-scale language models (LLMs) from a single-stream perspective. Existing group-based methods, such as GRPO, reduce variance by using real-time baselines, but suffer from serious drawbacks such as loss of training signal due to frequent degenerate groups and poor scalability due to synchronization barriers. In this paper, we propose single-stream policy optimization (SPO), which eliminates these issues at the design stage. SPO replaces group-specific baselines with a continuous KL-adaptive value tracker and globally normalizes the gain across batches, providing stable, low-variance training signals for all samples. By eliminating grouping, SPO enables higher throughput and effectively scales in long-term or tool-integrated settings. Furthermore, the continuous value tracker naturally enables adaptive curricula via priority sampling. Experimental results using Qwen3-8B demonstrate that SPO converges more smoothly and achieves higher accuracy than GRPO, while eliminating computational waste in degenerate groups. Further research confirms that SPO's performance gains stem from its principled approach to baseline estimation and advantage regularization, providing a more robust and efficient path for LLM inference. Using Qwen3 8B on five challenging mathematical benchmarks, SPO improves average maj@32 by +3.4% over GRPO, demonstrating significant absolute score gains on challenging datasets such as BRUMO 25 (+7.3 percentage points), AIME 25 (+4.4 percentage points), and HMMT 25 (+3.3 percentage points), while achieving consistent relative gains in pass@$k$ for all values of k evaluated. SPO's success challenges the existing trend of adding inadvertent complexity to RL algorithms and highlights a path where fundamental principles, rather than architectural solutions, will drive the next advance in LLM inference.

Takeaways, Limitations

Takeaways:
Single-stream-based policy gradient optimization (SPO) addresses the degenerate group problem and synchronization barriers, which are drawbacks of existing group-based methods (GRPO), thereby providing higher throughput and scalability.
KL-adaptive value tracker and global advantage regularization provide stable, low-variance training signals, resulting in smoother convergence and higher accuracy.
Continuous value trackers allow for natural implementation of adaptive curricula.
It outperforms existing methods on various mathematical benchmarks.
We present a novel approach that reduces the complexity of RL algorithms and focuses on fundamental principles.
Limitations:
Currently, only experimental results for the Qwen3-8B model are presented, and generalizability to other models or tasks requires further research.
There is a lack of detailed explanation of parameter settings or optimization of the KL-adaptive value tracker.
Further validation is needed to determine whether the advantages of a single-stream approach can be applied to all types of LLMs and jobs.
👍