This paper explores the use of reinforcement learning algorithms, such as Group Relative Policy Optimization (GRPO), to improve the inference performance of large-scale language models (LLMs). Existing GRPO suffers from limitations in long-chain inference tasks due to its coarse credit allocation, which applies the same reward to all tokens. This paper proposes Dynamic Entropy Weighting (DEN) to address this issue. Based on the idea that tokens with higher entropy in the correct answer can guide the policy toward higher performance ceilings, we generate more granular reward signals through two methods. First, Group Token Policy Optimization (GTPO) assigns entropy-weighted rewards to each token, achieving granular credit allocation. Second, Sequence-Level Group Relative Policy Optimization (GRPO-S) assigns entropy-weighted rewards to each sequence based on the average token entropy of that sequence. Experimental results demonstrate that the proposed method significantly outperforms the robust DAPO baseline model, confirming that the entropy-weighting mechanism is the primary driving force behind the performance improvement. This suggests a better way to improve the model's deep inference.