Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Created by
  • Haebom

Author

Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu

Outline

This paper addresses the role of exemplar learning or preference signal learning in the post-training phase, a key step for applying pre-trained large-scale language models (LLMs) to real-world tasks. We present a theoretical framework that unifies preference learning methods such as Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), and show through rigorous mathematical derivations that both SFT and DPO operate in the same optimal policy-reward subspace, and that SFT is a special case of implicit reward learning. We point out that an important Limitations of conventional SFT is that the KL divergence term of distribution matching during optimization becomes constant for the policy, failing to constrain model updating. To address this, we propose a learning rate decay technique, which achieves performance enhancement (up to 25% relative improvement and 6% absolute win rate increase). In addition, we derive an alternative SFT objective function derived from various f-divergence functions that maintain the KL term during optimization, which further improves the model performance after DPO, and extend the theoretical relationship between the LLM logit and the Q-function in preference learning to the SFT context, providing mathematical derivation and experimental verification.

Takeaways, Limitations

Takeaways:
Presenting an integrated theoretical framework of SFT and preference learning methods
Identification and solution of __T11157_____(KL divergence term problem) of existing SFT (reducing learning rate)
Performance improvement through derivation of alternative SFT objective functions
Extension and validation of the relationship between LLM logit and Q-function to the SFT context
Significant performance improvements in instruction following tasks (up to 25% relative improvement and 6% absolute win rate increase)
Limitations:
Further studies are needed to determine the generality of the proposed method and its applicability to other types of tasks.
Further research is needed to determine the optimal value of the learning rate decay technique.
A clear discussion is needed on the limitations and scope of applicability of the proposed theoretical framework.
👍