Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Created by
  • Haebom

Author

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

Outline

This paper studies an optimal way to integrate supervised fine-tuning (SFT) and reinforcement learning (RL) to improve the inference ability of large-scale language models (LLMs). From an entropy-based perspective, we comprehensively analyze token distribution, learning dynamics, and integration mechanisms, revealing that SFT induces macroscopic global changes in the LLM policy distribution, while RL performs microscopic selective optimization, and that entropy is an important indicator of training effectiveness. Based on these observations, this paper proposes supervised reinforcement fine-tuning (SRFT), a single-step method that integrates the two fine-tuning paradigms via an entropy-aware weighting mechanism. Instead of a two-step sequential method, SRFT simultaneously applies SFT and RL to directly optimize LLMs using demos and self-exploratory rollouts. Extensive experiments show that SRFT achieves 9.0% performance improvement on five mathematical inference benchmarks and 10.9% on three out-of-distribution benchmarks, and achieves an average accuracy of 59.1%, outperforming Zero-RL methods.

Takeaways, Limitations

Takeaways:
By clearly clarifying the differences between SFT and RL from an entropy perspective, we present a new methodology that combines the strengths of both methods.
We present the possibility of more efficient LLM fine-tuning than the conventional two-step sequential method through a single-step method, SRFT.
Demonstrated superior performance compared to existing methods in various benchmarks.
Limitations:
Further verification of the generalization performance of the proposed SRFT method is needed.
Further research is needed on its applicability to different types of LLM and inference tasks.
Lack of detailed description of the optimization process of entropy-based weighting mechanisms.
👍