This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning
Created by
Haebom
Author
Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, Dandan Tu
Outline
This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel framework that overcomes the limitations of Reinforcement Learning with Verifiable Rewards (RLVR) as a reinforcement learning method for improving the complex inference ability of large-scale language models (LLMs). Existing on-policy reinforcement learning methods have a problem that learning is impeded by sparse reward signals when the complexity of the training data exceeds the capability of the model. GHPO solves this problem by dynamically adjusting the task difficulty through adaptive prompt improvement. It creates an efficient learning process by using direct imitation learning for problems that exceed the current capability of the model and exploration-based reinforcement learning for problems that are manageable. Experimental results on six mathematical benchmarks show that GHPO outperforms existing methods by an average of 5%, and improves both the training stability and the final inference performance.
Takeaways, Limitations
•
Takeaways:
◦
We demonstrate that difficulty adjustment through adaptive prompt improvement can significantly improve the training stability and efficiency of reinforcement learning.
◦
GHPO overcomes the limitations of on-policy reinforcement learning and curriculum learning and presents a method that can be effectively applied to small-scale LLMs.
◦
We experimentally demonstrate that it is an effective method for improving the performance of LLM in tasks requiring complex reasoning abilities.
•
Limitations:
◦
Generalization performance to other domains or task types beyond the six presented mathematics benchmarks requires further study.
◦
Further research is needed to optimize adaptive prompt improvement strategies and to consider the impact of the complexity of the prompt generation process on the overall efficiency of the system.
◦
Further analysis is needed to determine to what extent GHPO's performance improvements depend on specific benchmarks or hyperparameter settings.