Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Created by
  • Haebom

Author

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou

Outline

This paper presents CHORD, a novel framework that integrates supervised fine-tuning (SFT) and reinforcement learning (RL), two major post-training methods for improving the performance and aligning actions of large-scale language models (LLMs). Existing approaches that integrate SFT and RL break existing model patterns and risk overfitting to expert data. CHORD addresses this issue by restructuring SFT as a dynamically weighted sub-objective within the on-policy RL process, rather than as a separate step. It integrates a dual control mechanism by analyzing the impact of off-policy expert data at both global and granular levels. It uses global coefficients to guide the transition from off-policy imitation to on-policy exploration, and applies token-specific weighting functions to enable fine-grained learning from expert tokens, preserving on-policy exploration while mitigating the disruption caused by off-policy data. Extensive experiments demonstrate that CHORD achieves a stable and efficient learning process, demonstrating significant performance improvements over baseline models.

Takeaways, Limitations

Takeaways:
We present CHORD, a new framework that effectively integrates SFT and RL.
Achieving stable and efficient learning through a dual control mechanism that controls the influence of off-policy data at both global and granular levels.
Experimentally proven performance improvement over existing methods.
Encourage further research through open source code release.
Limitations:
Further research is needed on the generalization performance of the proposed framework.
Further experiments on different LLM architectures and datasets are needed.
Further analysis of computational cost and complexity is needed.
👍