Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

Created by
  • Haebom

Author

Chongli Qin, Jost Tobias Springenberg

Outline

This paper reinterprets Behavior Cloning (BC), a traditional supervised learning method, from a Reinforcement Learning (RL) perspective, explaining that it maximizes a lower bound on the RL objective function in a sparse reward environment. We demonstrate that conventional supervised fine-tuning (SFT) can be understood as a method for maximizing this lower bound, and propose that a modification of SFT into importance-weighted supervised fine-tuning (iw-SFT) provides a more accurate approximation of the RL objective function. iw-SFT can outperform SFT and generalize well to data with quality scores. Experimental results demonstrate that iw-SFT is competitive with advanced RL algorithms on large-scale language models and continuous control tasks, achieving a performance of 66.7% on the AIME 2024 dataset.

Takeaways, Limitations

Takeaways:
We reinterpreted SFT from an RL perspective to strengthen its theoretical foundation.
We improve the performance of SFT by proposing iw-SFT.
We propose a method to generalize SFT by leveraging quality score data.
Competitive results have been achieved on large-scale language models and continuous control tasks.
Limitations:
The performance improvements achieved with iw-SFT may not be consistent across all cases. The degree of performance improvement may vary depending on the characteristics of the data.
Further research is needed to determine the generalizability of the methodology presented in this paper. Further experiments are needed across a variety of environments and tasks.
The lack of experimental results on datasets other than the AIME 2024 dataset raises questions about generalization performance.
👍