Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Created by
  • Haebom

Author

Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu

Outline

This paper compares and analyzes the impact of two core learning paradigms—Supervised Fine-tuning (SFT) and Reinforcement Fine-tuning (RFT)—on the Continuous Post-Training (CPT) of multimodal large language models. Experiments using the Qwen2.5-VL-7B-Instruct model and seven diverse multimodal task benchmarks reveal that SFT rapidly forgets knowledge about previously learned tasks, whereas RFT retains prior knowledge and improves general knowledge. The stability of RFT stems from a data-dependent regularization mechanism naturally regulated by reward distribution, rather than explicit mechanisms such as the KL penalty or chain-of-thought reasoning. Furthermore, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT.

Takeaways, Limitations

Takeaways:
In continuous follow-up learning, RFT is a more robust and stable paradigm than SFT.
RFT retains prior knowledge and improves general model capabilities.
The stability of RFT is due to its data-dependent normalization mechanism.
The performance of RFT can be further improved by using a rollout-based instance filtering algorithm.
Limitations:
Specific Limitations is not specified in the paper.
👍