While large-scale inference models have demonstrated the remarkable ability to generate long chains of thought (CoTs) in English, understanding how this long-term inference ability transfers to the majority of languages worldwide remains limited. In this study, we systematically examine four key stages of model development—scaling, pretraining, posttraining, and inference—to understand how long CoT capabilities extend beyond English. We compare two inference settings for nine non-English target languages: En-CoT (where the model processes target language input but infers in English) and Target-CoT (where the model processes input and generates long CoTs in the target language). Increasing model size improves multilingual task performance in En-CoT but lags behind in Target-CoT performance. This gap is further exacerbated in tasks requiring long, multi-stage CoTs, such as mathematical reasoning. Moving to pretraining, adding specialized inference steps improves En-CoT performance but degrades Target-CoT, whereas extensive multilingual pretraining simultaneously improves both modes. Due to the lack of high-quality inference traces in languages other than English, we explore a synthetic data curation approach for post-training. We show that fine-tuning on automatically translated traces from the English traces of the Golden Letter outperforms fine-tuning on target language traces extracted from a large-scale inference model. Finally, we report discrepancies in inference efficiency across languages and identify language-specific failure modes in CoT. We make the model, dataset, and code publicly available for further research.