Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning

Created by
  • Haebom

Author

Yutong Chen, Jiandong Gao, Ji Wu

Outline

This paper demonstrates that rule-based reinforcement learning (RL) significantly improves the inference performance of large-scale language models (LLMs), but the underlying mechanisms remain unclear. We find that small-scale supervised fine-tuning (SFT) significantly impacts RL but is inefficient, and propose an analytical framework to explain this. We compare the efficiency of SFT and RL by measuring the sampling effect and suggest the possibility of improving SFT's efficiency. Based on this analysis, we propose a "re-distillation" technique that samples from RL-trained policies to enhance the effectiveness of small-scale distillation. On three datasets and the Qwen & Llama model, we demonstrate that the re-distillation model achieves RL performance with significantly fewer samples and computations. On the K & K dataset, the re-distilled Qwen-2.5-1.5B model outperforms DeepSeek-V3-0324 with only 1K SFT samples. Furthermore, we demonstrate that redistillation can be used to efficiently balance multiple objectives in RL, and explain several interesting phenomena in R1-style RL, revealing the mechanisms behind its empirical success.

Takeaways, Limitations

Takeaways:
Proposal of a redistillation technique to improve the efficiency of small-scale SFT.
Achieving RL-level performance with fewer samples and computations.
Increased understanding of the mechanisms of R1-style RL.
Presenting the possibility of multi-objective balancing in RL.
Limitations:
Further research is needed to determine the generalizability of the proposed analytical framework and redistillation technique.
Further experiments on various LLMs and datasets are needed.
Further analysis of the computational cost and practical limitations of redistillation techniques is needed.
👍