Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Created by
  • Haebom

Author

Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, Maosong Sun

Outline

This paper demonstrates the utility of multi-parallel data to improve the performance of multilingual large-scale language models (LLMs), including those with limited resources. We highlight the limitations of existing pre-training and instruction tuning approaches using unaligned multilingual data and introduce TED2025, a large-scale, high-quality multi-parallel corpus based on TED Talks, encompassing 113 languages and up to 50 languages aligned in parallel. Using TED2025, we explore strategies for leveraging multi-parallel data, including continuous pre-training and instruction tuning. We experimentally demonstrate that models based on multi-parallel data outperform models based on unaligned data across six multilingual benchmarks.

Takeaways, Limitations

Takeaways:
We empirically demonstrate the effectiveness of LLM pre-training and directed adjustment strategies using multi-parallel data.
Providing TED2025, a large-scale, high-quality, multi-parallel corpus that can contribute to improving multilingual LLM performance, including low-resource languages.
Providing optimal strategies and influencing factor analysis for utilizing multi-parallel data.
Limitations:
Due to the nature of the corpus based on TED Talks data, there is a possibility of bias toward knowledge in certain areas.
Difficulties and cost issues in multi-parallel data generation and sorting processes.
Further research is needed to determine the impact of factors other than those analyzed on LLM performance.
👍