This paper revisits policy gradient optimization for large-scale language models (LLMs) from a single-stream perspective. Existing group-based methods, such as GRPO, reduce variance by using real-time baselines, but suffer from serious drawbacks such as loss of training signal due to frequent degenerate groups and poor scalability due to synchronization barriers. In this paper, we propose single-stream policy optimization (SPO), which eliminates these issues at the design stage. SPO replaces group-specific baselines with a continuous KL-adaptive value tracker and globally normalizes the gain across batches, providing stable, low-variance training signals for all samples. By eliminating grouping, SPO enables higher throughput and effectively scales in long-term or tool-integrated settings. Furthermore, the continuous value tracker naturally enables adaptive curricula via priority sampling. Experimental results using Qwen3-8B demonstrate that SPO converges more smoothly and achieves higher accuracy than GRPO, while eliminating computational waste in degenerate groups. Further research confirms that SPO's performance gains stem from its principled approach to baseline estimation and advantage regularization, providing a more robust and efficient path for LLM inference. Using Qwen3 8B on five challenging mathematical benchmarks, SPO improves average maj@32 by +3.4% over GRPO, demonstrating significant absolute score gains on challenging datasets such as BRUMO 25 (+7.3 percentage points), AIME 25 (+4.4 percentage points), and HMMT 25 (+3.3 percentage points), while achieving consistent relative gains in pass@$k$ for all values of k evaluated. SPO's success challenges the existing trend of adding inadvertent complexity to RL algorithms and highlights a path where fundamental principles, rather than architectural solutions, will drive the next advance in LLM inference.