Reinforcement learning (RL) plays a crucial role in improving the inference performance of large-scale language models (LLMs). However, existing algorithms employ a crude credit allocation method that uniformly applies rewards to all tokens in a sequence, a critical flaw in long-chain inference tasks. To address this issue, this paper proposes a novel mechanism, Dynamic Entropy Weighting, that facilitates fine-tuned rewards via two novel algorithms: Group Token Policy Optimization (GTPO) and Sequence-Level GRPO (GRPO-S). The proposed method is based on the hypothesis that high policy entropy within the inference path is a powerful heuristic that indicates cognitive effort at critical junctures. By utilizing policy entropy in reward formation, we achieve true token-specific credit allocation. Experimental results demonstrate that our method outperforms the robust DAPO baseline, confirming that the entropy weighting mechanism is a key driver of performance improvement.