This paper presents the results of a study on the phenomenon that different families of base language models, such as Llama and Qwen, exhibit different behaviors during reinforcement learning (RL)-based post-training, especially in inference-intensive tasks. We investigate the effects of intermediate training strategies on RL dynamics, focusing on two families of models, Qwen and Llama. We find that a high-quality math corpus (MegaMath-Web-Pro) improves both the base model and RL performance, while existing alternatives (e.g. FineMath-4plus) do not. We also find that adding QA-style data, especially long-course of thought (CoT) inference examples, improves RL results, and that additional instruction data further amplifies this effect. We show that long CoT improves inference depth, but can cause verbosity in model responses and instability in RL training. Finally, we find that increasing the size of intermediate training improves sub-RL performance. Based on these insights, the researchers developed a family of OctoThinker models that exhibit strong RL compatibility by proposing a two-stage intermediate training strategy, Stable-then-Decay. They also released datasets such as MegaMath-Web-Pro-Max (over 70 billion tokens).