This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper addresses two major training data sources for post-learning modern language models: online data (model generation rollouts) and offline data (human or other model demonstrations). Approaches such as reinforcement learning (RL) and supervised fine-tuning (SFT) each utilize these two types of data. This paper demonstrates that these approaches are not contradictory but rather instances of a single optimization process. We derive a unified policy gradient estimator and present computation of a comprehensive post-learning approach as the gradient of a common objective under various data distribution assumptions and bias-variance trade-offs. This gradient estimator consists of four interchangeable parts: a stabilization mask, a reference policy denominator, a benefit estimator, and a likelihood gradient. Based on theoretical findings, this paper proposes Hybrid Post-Learning (HPT), an algorithm that dynamically selects training signals. HPT is designed to provide both effective utilization of demonstrations and robust exploration without sacrificing learned inference patterns. This paper presents extensive experimental and ablation studies to validate the unified theoretical framework and the effectiveness of HPT. Across six mathematical inference benchmarks and two distributional outliers, HPT consistently outperforms robust baseline models across a range of model sizes and classes.
Takeaways, Limitations
•
Takeaways:
◦
We enhance theoretical understanding by presenting a single optimization framework that integrates post-training approaches (e.g., RL, SFT).
◦
We propose an effective hybrid post-training (HPT) algorithm that simultaneously achieves demo utilization and stable exploration.
◦
We experimentally verified the superior performance of HPT on various benchmarks.
◦
It showed consistent performance improvements regardless of model size and series.
•
Limitations:
◦
Further research may be needed to determine the optimal parameters of the proposed HPT algorithm.
◦
Further validation of generalization performance across different types of language models and tasks is required.
◦
A detailed analysis of the computational cost and efficiency of HPT may be required.