This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
CURE: Critical-Token-Guided Re-Concatenation for Entropy-Collapse Prevention
Created by
Haebom
Author
Qingbin Li, Rongkun Xue, Jie Wang, Ming Zhou, Zhi Li, Xiaofeng Ji, Yongqi Wang, Miao Liu, Zheming Yang, Minghui Qiu, Jing Yang
Outline
This paper studies the enhancement of the inference capability of large-scale language models (LLMs) using reinforcement learning-based validated rewards. To address the limitations of existing methods, such as the excessive determinism of initial state sampling and the entropy collapse problem, we propose a framework called CURE (Critical-token-gUided Re-concatenation for Entropy-collapse Prevention) that balances exploration and exploitation. CURE consists of two stages: the first stage inducing the model into a new context through high-entropy critical token regeneration and co-optimizing the original and branched paths; and the second stage enhancing exploitation by utilizing static initial state sampling in the existing DAPO method. Experimental results using the Qwen-2.5-Math-7B model show that CURE achieves a 5% performance improvement over existing RLVR methods across six mathematical benchmarks, achieving state-of-the-art performance in both entropy and accuracy.
Takeaways, Limitations
•
Takeaways:
◦
We present a new framework, CURE, that effectively solves the entropy decay problem of existing RLVR methods.
◦
Presenting an effective strategy to improve the mathematical reasoning performance of LLM students through a balance of exploration and exploitation.
◦
Achieves state-of-the-art performance on six mathematical benchmarks on the Qwen-2.5-Math-7B model.
◦
Providing research reproducibility and scalability through open source code disclosure.
•
Limitations:
◦
CURE's performance gains may be limited to a specific model (Qwen-2.5-Math-7B) and mathematical reasoning tasks.
◦
Need to verify generalization performance for other types of tasks or models.
◦
Clarification and optimization of high-entropy important token selection criteria in Step 1 are needed.
◦
Further research is needed on hyperparameter tuning to balance the two stages.