This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training
Created by
Haebom
Author
Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu
Outline
This paper compares and analyzes the impact of two core learning paradigms—Supervised Fine-tuning (SFT) and Reinforcement Fine-tuning (RFT)—on the Continuous Post-Training (CPT) of multimodal large language models. Experiments using the Qwen2.5-VL-7B-Instruct model and seven diverse multimodal task benchmarks reveal that SFT rapidly forgets knowledge about previously learned tasks, whereas RFT retains prior knowledge and improves general knowledge. The stability of RFT stems from a data-dependent regularization mechanism naturally regulated by reward distribution, rather than explicit mechanisms such as the KL penalty or chain-of-thought reasoning. Furthermore, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT.
Takeaways, Limitations
•
Takeaways:
◦
In continuous follow-up learning, RFT is a more robust and stable paradigm than SFT.
◦
RFT retains prior knowledge and improves general model capabilities.
◦
The stability of RFT is due to its data-dependent normalization mechanism.
◦
The performance of RFT can be further improved by using a rollout-based instance filtering algorithm.
•
Limitations:
◦
Specific Limitations is not specified in the paper.