Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Created by
  • Haebom

Author

Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin

Outline

This paper presents a method for improving the inference capability of large-scale language models (LLMs) using reinforcement learning (RL). Existing Group Relative Policy Optimization (GRPO) methods suffer from a limitation in that they cannot update policies for all-negative-sample groups. To address this limitation, we propose stepwise guided policy optimization (SGPO), a simple framework that utilizes a step-wise judge model to increase the diversity of responses within groups. This model can be trained directly or by leveraging existing LLMs, and we theoretically demonstrate that it accelerates GRPO learning in a simplified environment. Experimental results demonstrate that SGPO outperforms GRPO in both offline and online training on nine benchmarks (including baseline and distillation versions) for models of various sizes (7B, 14B, and 32B). The performance gains are particularly notable in the early and intermediate training stages, where all-negative-sample groups are numerous. Furthermore, SGPO differentiates itself from knowledge distillation methods by not requiring a judge model to generate the correct answer.

Takeaways, Limitations

Takeaways:
Contributes to improving the inference ability of reinforcement learning-based LLM by solving the all-negative-sample groups problem.
Improving the learning efficiency of GRPO by utilizing a step-by-step judgment model.
Shows consistent performance improvements across LLMs of various sizes.
Unlike knowledge distillation methods, no answer generation model is required.
Limitations:
The effectiveness of the proposed method may be limited to theoretical proofs in a simplified environment.
Further research may be needed on the design and training of step-wise judgment models.
Additional experiments with more diverse and complex benchmarks may be required.
👍