Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Created by
  • Haebom

Author

Hongze Tan, Jianfei Pan, Jinghao Lin, Tao Chen, Zhihang Zheng, Zhihao Tang, Haihua Yang

Outline

Reinforcement learning (RL) plays a crucial role in improving the inference performance of large-scale language models (LLMs). However, existing algorithms employ a crude credit allocation method that uniformly applies rewards to all tokens in a sequence, a critical flaw in long-chain inference tasks. To address this issue, this paper proposes a novel mechanism, Dynamic Entropy Weighting, that facilitates fine-tuned rewards via two novel algorithms: Group Token Policy Optimization (GTPO) and Sequence-Level GRPO (GRPO-S). The proposed method is based on the hypothesis that high policy entropy within the inference path is a powerful heuristic that indicates cognitive effort at critical junctures. By utilizing policy entropy in reward formation, we achieve true token-specific credit allocation. Experimental results demonstrate that our method outperforms the robust DAPO baseline, confirming that the entropy weighting mechanism is a key driver of performance improvement.

Takeaways, Limitations

Takeaways:
We propose a novel mechanism, Dynamic Entropy Weighting, that enables token-specific credit allocation in LLM inference.
Implementation of Dynamic Entropy Weighting via GTPO and GRPO-S algorithms.
Demonstrated superior performance compared to a strong baseline.
A novel approach that utilizes policy entropy as a reward signal is presented.
Limitations:
There is no specific mention of Limitations in the paper.
👍