Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Long Chain-of-Thought Reasoning Across Languages

Created by
  • Haebom

Author

Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

Outline

This paper explores the multilingual extension of long-form thought processes (CoTs), which contribute to the improved inference performance of large-scale language models (LLMs). We fine-tuned the Qwen 2.5 (7B) and Qwen 3 (8B) models using two English-based inference datasets translated into French, Japanese, Latvian, and Swahili. Experiments revealed that the effectiveness of using English as a bridge language varied across languages (ineffective for French, effective for Japanese and Latvian, and weak for Swahili). Furthermore, extensive multilingual pretraining in Qwen 3 reduced, but did not completely eliminate, the performance gap between languages. Fine-tuning on a small dataset (1k traces) alone improved performance in Swahili by more than 30%. Finally, the trade-off between data quality and scale varied across languages: English and French benefited from smaller, more refined datasets, while Swahili and Latvian benefited from larger, noisier corpora. These results clarify how and why long CoTs transfer across languages and provide a translated dataset for fair multilingual inference studies.

Takeaways, Limitations

Takeaways:
The effectiveness of using English as a mediating language varies across languages.
We present the importance of multilingual dictionary learning and the effectiveness of fine-tuning on small datasets.
We show that the trade-off between data quality and scale varies across languages.
Providing translated datasets for multilingual inference research.
Limitations:
The number of languages used in the study is limited.
There is a possibility that the results are limited to a specific LLM model.
Further research is needed on generalizability to different types of reasoning tasks.
👍