Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Created by
  • Haebom

Author

Hongze Tan, Jianfei Pan

Outline

This paper explores the use of reinforcement learning algorithms, such as Group Relative Policy Optimization (GRPO), to improve the inference performance of large-scale language models (LLMs). Existing GRPO suffers from limitations in long-chain inference tasks due to its coarse credit allocation, which applies the same reward to all tokens. This paper proposes Dynamic Entropy Weighting (DEN) to address this issue. Based on the idea that tokens with higher entropy in the correct answer can guide the policy toward higher performance ceilings, we generate more granular reward signals through two methods. First, Group Token Policy Optimization (GTPO) assigns entropy-weighted rewards to each token, achieving granular credit allocation. Second, Sequence-Level Group Relative Policy Optimization (GRPO-S) assigns entropy-weighted rewards to each sequence based on the average token entropy of that sequence. Experimental results demonstrate that the proposed method significantly outperforms the robust DAPO baseline model, confirming that the entropy-weighting mechanism is the primary driving force behind the performance improvement. This suggests a better way to improve the model's deep inference.

Takeaways, Limitations

Takeaways:
We show that the long-chain inference performance of LLM can be improved by using dynamic entropy weights.
We propose that GTPO and GRPO-S algorithms overcome the limitations of existing GRPOs and enable more granular credit allocation.
We experimentally demonstrate that the entropy weighting mechanism plays a crucial role in improving deep inference in LLM.
We demonstrate the effectiveness of the proposed method by achieving better performance than the DAPO baseline model.
Limitations:
Further research is needed to evaluate the generalization performance of the proposed method.
More experimental results on different types of LLM and inference tasks are needed.
Further research may be needed to determine the optimal value of the entropy weights.
Consideration may need to be given to increased computational costs.
👍