This paper addresses the policy entropy decay problem that occurs during Reinforcement Learning with Verifiable Rewards (RLVR) training to improve the inference performance of large-scale language models (LLMs). Policy entropy decay occurs when policies become overly deterministic, hindering exploration and limiting inference performance. This study proposes a framework called Adaptive Entropy Regularization (AER), which dynamically balances exploration and exploitation through difficulty-aware coefficient allocation, an initially anchored target entropy, and dynamic global coefficient adjustment. Experimental results on several mathematical inference benchmarks demonstrate that AER outperforms existing methods, improving both inference accuracy and exploration.