Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

Created by
  • Haebom

Author

Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, Newsha Ardalani

Outline

This study questions whether supervised fine-tuning (SFT) scores, used in the post-training process to improve the inference capability of large-scale language models (LLMs), guarantee improved performance after reinforcement learning (RL). We present cases where high SFT scores lead to poor post-RL performance, suggesting that SFT scores can be biased against simple or homogeneous data. We perform SFT and RLVR (GRPO) training on various models and datasets, and demonstrate that generalization loss and Pass@large k performance metrics are more useful in predicting RL results.

Takeaways, Limitations

Takeaways:
We demonstrate that high SFT scores do not always lead to improved RL performance.
Generalization loss and Pass@large k are presented as powerful indicators of RL outcome prediction.
Analysis of the impact of SFT training methods (e.g., number of epochs, example length) on RL results.
Emphasizes the importance of changing training strategies within the SFT budget.
Open source evaluation tools to be provided.
Limitations:
Experiments were conducted up to a 12B-parameter model, and generalization to larger models requires further study.
May be limited to specific SFT/RL datasets and methods.
Further validation is needed to determine whether the findings can be generalized to all LLM architectures.
👍