Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Created by
  • Haebom

Author

Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, Maosong Sun

Outline

This paper focuses on leveraging multilingual parallel data to improve the performance of large-scale language models (LLMs) for low-resource languages. We highlight the limitations of existing pre-training and instruction tuning approaches using unaligned multilingual data and present a multilingual parallel data corpus, specifically TED2025, a large-scale, high-quality multilingual parallel corpus spanning 113 languages built from TED Talks. Using TED2025, we study how strategies such as continuous pre-training and instruction tuning can improve the performance of LLMs. We experimentally demonstrate that models based on multilingual parallel data outperform models based on unaligned multilingual data across six multilingual evaluation criteria.

Takeaways, Limitations

Takeaways:
Experimentally demonstrating the effectiveness of LLM pre-training and fine-tuning using multilingual parallel data.
TED2025, a large-scale, high-quality, multilingual parallel corpus.
Presenting an optimal strategy for utilizing multilingual parallel data.
Contributes to improving LLM performance for low-resource languages.
Limitations:
Due to the nature of the corpus based on TED Talks data, further research on generalizability is needed.
Lack of comparative analysis with other types of multilingual data.
Lack of discussion on the cost and resource consumption of building and utilizing multilingual parallel data.
👍