Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Created by
  • Haebom

Author

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

Outline

In this paper, we analyze the advantages and disadvantages of supervised fine-tuning (SFT) and reinforced fine-tuning (RFT), which are post-training techniques for large-scale language models (LLMs), and propose a new method, Prefix-RFT, which integrates them. SFT has excellent imitation ability but has difficulty in generalization, and RFT is effective in improving performance but has limitations in learning unexpected behaviors and being sensitive to the initial policy. Prefix-RFT combines the advantages of SFT and RFT to perform demonstration data learning and exploratory learning simultaneously, and demonstrates that it outperforms SFT, RFT, and parallel mixed-policy RFT through experiments using mathematical inference problems. In addition, it can be easily integrated into existing open source frameworks, and its robustness to the quality and quantity of demonstration data is also confirmed.

Takeaways, Limitations

Takeaways:
Prefix-RFT, which integrates the advantages of SFT and RFT, shows better performance than existing methods.
Prefix-RFT can be easily applied to existing open source frameworks.
We verified the robustness of the demonstration data in terms of quality and quantity.
We highlight the complementary nature of SFT and RFT and suggest that an integrated paradigm is a promising direction for future research.
Limitations:
We evaluated performance only in a specific domain, mathematical reasoning problems. Generalizability to other domains requires further study.
There is a lack of detailed discussion on parameter settings and optimization of Prefix-RFT.
👍