Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

Created by
  • Haebom

Author

Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, Xiansheng Hua

Outline

In this paper, we propose a self-guided process reward optimization (SPRO) framework to address the high computational cost of process reinforcement learning (PRL), which has shown significant potential for improving the inference capability of large-scale language models (LLMs), and the lack of a unified theoretical framework for process-level advantage estimation. SPRO enables process-aware RL through two key innovations: theoretically proving that process rewards can be derived from the policy model itself, and introducing well-defined cumulative process rewards and masked-step advantage (MSA) to enable strict step-by-step action advantage estimation within a shared prompt sampling group. Experimental results show that SPRO achieves 3.4 times higher training efficiency and 17.5% improved test accuracy than the conventional GRPO. In addition, we demonstrate sufficient exploration and prevention of reward hacking by reducing the average response length by about 1/3 while maintaining stable and high policy entropy throughout the training process. In particular, SPRO is advantageous for industrial implementation because it does not incur additional computational costs compared to result-supervised RL methods such as GRPO.

Takeaways, Limitations

Takeaways:
We propose SPRO, a novel framework that effectively addresses the computational cost problem of process reinforcement learning.
Achieve improved training efficiency and test accuracy compared to existing methods.
Efficient exploration and reward hacking prevention by maintaining stable policy entropy and shortening response length.
Increasing industrial applicability by enabling process-aware reinforcement learning without additional computational cost.
Limitations:
Further verification of the generalizability of the presented theoretical proofs and experimental results is needed.
There is a need to evaluate the applicability and performance of SPRO for various LLM architectures and tasks.
A more detailed description and analysis of the design and parameter settings of MSA is needed.
👍