This paper explores the use of reinforcement learning algorithms, such as Group Relative Policy Optimization (GRPO), to improve the inference performance of large-scale language models (LLMs). Existing GRPOs suffer from limitations in long-term inference tasks due to their coarse credit allocation, which applies the same reward to all tokens in a sequence. To address this, we propose a dynamic entropy weighting technique. Based on the core idea that tokens with high entropy in the correct answer lead to higher performance, we generate more granular reward signals through two methods. First, **Group Token Policy Optimization (GTPO)** assigns entropy-weighted rewards to each token, achieving granular credit allocation. Second, **Sequence-Level Group Relative Policy Optimization (GRPO-S)** assigns entropy-weighted rewards to each sequence based on the average token entropy of that sequence. Experimental results demonstrate that the proposed method significantly outperforms the robust DAPO baseline model, confirming that the entropy weighting mechanism is the primary driver of the performance improvement.