Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Long Chain-of-Thought Reasoning Across Languages

Created by
  • Haebom

Author

Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

Outline

While large-scale inference models have demonstrated the remarkable ability to generate long chains of thought (CoTs) in English, understanding how this long-term inference ability transfers to the majority of languages worldwide remains limited. In this study, we systematically examine four key stages of model development—scaling, pretraining, posttraining, and inference—to understand how long CoT capabilities extend beyond English. We compare two inference settings for nine non-English target languages: En-CoT (where the model processes target language input but infers in English) and Target-CoT (where the model processes input and generates long CoTs in the target language). Increasing model size improves multilingual task performance in En-CoT but lags behind in Target-CoT performance. This gap is further exacerbated in tasks requiring long, multi-stage CoTs, such as mathematical reasoning. Moving to pretraining, adding specialized inference steps improves En-CoT performance but degrades Target-CoT, whereas extensive multilingual pretraining simultaneously improves both modes. Due to the lack of high-quality inference traces in languages other than English, we explore a synthetic data curation approach for post-training. We show that fine-tuning on automatically translated traces from the English traces of the Golden Letter outperforms fine-tuning on target language traces extracted from a large-scale inference model. Finally, we report discrepancies in inference efficiency across languages and identify language-specific failure modes in CoT. We make the model, dataset, and code publicly available for further research.

Takeaways, Limitations

Increasing the model size improves En-CoT (inference in English) performance, but not Target-CoT (inference in target language) performance.
The gap between En-CoT and Target-CoT widens in complex tasks such as mathematical reasoning.
Adding a specialized inference step has a positive effect on En-CoT but a negative effect on Target-CoT.
Extensive multilingual dictionary training benefits both En-CoT and Target-CoT.
Fine-tuning using machine-translated inference tracking from English is more effective than using target language tracking directly.
Language-specific differences exist in inference efficiency and CoT failure modes.
The lack of high-quality inference tracking data is a limitation.
👍