Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Created by
  • Haebom

Author

Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu

Outline

This paper presents the results of a study on the phenomenon that different families of base language models, such as Llama and Qwen, exhibit different behaviors during reinforcement learning (RL)-based post-training, especially in inference-intensive tasks. We investigate the effects of intermediate training strategies on RL dynamics, focusing on two families of models, Qwen and Llama. We find that a high-quality math corpus (MegaMath-Web-Pro) improves both the base model and RL performance, while existing alternatives (e.g. FineMath-4plus) do not. We also find that adding QA-style data, especially long-course of thought (CoT) inference examples, improves RL results, and that additional instruction data further amplifies this effect. We show that long CoT improves inference depth, but can cause verbosity in model responses and instability in RL training. Finally, we find that increasing the size of intermediate training improves sub-RL performance. Based on these insights, the researchers developed a family of OctoThinker models that exhibit strong RL compatibility by proposing a two-stage intermediate training strategy, Stable-then-Decay. They also released datasets such as MegaMath-Web-Pro-Max (over 70 billion tokens).

Takeaways, Limitations

Takeaways:
Emphasize the importance of using a high-quality mathematics corpus (MegaMath-Web-Pro).
Demonstrating the effectiveness of QA-style data and long CoT inference examples.
Emphasize the importance of data format (presenting both advantages and disadvantages of long CoT).
Confirming the importance of scaling up intermediate training.
Introducing a new two-step intermediate training strategy (Stable-then-Decay) and the OctoThinker model family.
Publication of a large-scale mathematical inference-intensive corpus (MegaMath-Web-Pro-Max).
Limitations:
The study subjects are limited to two model families, Qwen and Llama. Further research is needed to determine generalizability to other model families.
Further research is needed on the optimal parameter settings of the Stable-then-Decay strategy.
A more effective solution is needed to address the verbosity and instability issues caused by long CoTs.
👍