This paper presents a method for improving the inference capability of large-scale language models (LLMs) using reinforcement learning (RL). Existing Group Relative Policy Optimization (GRPO) methods suffer from a limitation in that they cannot update policies for all-negative-sample groups. To address this limitation, we propose stepwise guided policy optimization (SGPO), a simple framework that utilizes a step-wise judge model to increase the diversity of responses within groups. This model can be trained directly or by leveraging existing LLMs, and we theoretically demonstrate that it accelerates GRPO learning in a simplified environment. Experimental results demonstrate that SGPO outperforms GRPO in both offline and online training on nine benchmarks (including baseline and distillation versions) for models of various sizes (7B, 14B, and 32B). The performance gains are particularly notable in the early and intermediate training stages, where all-negative-sample groups are numerous. Furthermore, SGPO differentiates itself from knowledge distillation methods by not requiring a judge model to generate the correct answer.