Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Created by
  • Haebom

Author

Hongze Tan, Jianfei Pan

Outline

This paper explores the use of reinforcement learning algorithms, such as Group Relative Policy Optimization (GRPO), to improve the inference performance of large-scale language models (LLMs). Existing GRPOs suffer from limitations in long-term inference tasks due to their coarse credit allocation, which applies the same reward to all tokens in a sequence. To address this, we propose a dynamic entropy weighting technique. Based on the core idea that tokens with high entropy in the correct answer lead to higher performance, we generate more granular reward signals through two methods. First, **Group Token Policy Optimization (GTPO)** assigns entropy-weighted rewards to each token, achieving granular credit allocation. Second, **Sequence-Level Group Relative Policy Optimization (GRPO-S)** assigns entropy-weighted rewards to each sequence based on the average token entropy of that sequence. Experimental results demonstrate that the proposed method significantly outperforms the robust DAPO baseline model, confirming that the entropy weighting mechanism is the primary driver of the performance improvement.

Takeaways, Limitations

Takeaways:
A novel reinforcement learning technique is presented to improve the long-term inference ability of LLM.
We present the potential for performance improvement through fine-grained credit allocation using dynamic entropy weighting.
Two approaches are presented: GTPO and GRPO-S.
Effectiveness verified through performance improvement compared to the DAPO reference model.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Further experimentation with different LLMs and jobs is needed.
Further research is needed on how to optimize entropy weights.
Potential increase in computational costs.
👍