Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Created by
  • Haebom

Author

Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Meng Fang

Outline

This paper addresses data quality, which is a critical factor in improving the performance of large-scale language models. To overcome the limitations of existing model-based data selection methods that focus only on English, we present a scalable framework called MuRating. MuRating transfers the quality signal of English data to 17 languages to create a single rater. It learns a unified document quality score through pairwise comparisons of multiple English raters and projects it onto translations to train multilingual raters. It applies it to web data to pretrain a 1.2 billion-parameter LLaMA model by selecting a balanced subset of English and multilingual content. It improves the accuracy in both English and multilingual evaluations compared to existing methods such as QuRater, AskLLM, and DCLM, and shows great performance especially in knowledge-intensive tasks. We analyze translation fidelity, selection bias, and underrepresentation of narrative data, and suggest future research directions.

Takeaways, Limitations

Takeaways:
Overcoming the limitations of existing English-centered data quality assessment methods, we present an effective framework (MuRating) for multilingual data quality assessment.
Contributed to high-quality data selection and performance improvement for multilingual LLM pre-training.
It shows particularly large performance improvements in knowledge-intensive tasks.
Limitations:
It raises issues of translation fidelity, selection bias, and underrepresentation of narrative material, and suggests the need for future research.
Further validation of the scalability and generalizability of MuRating is needed.
A more in-depth analysis of the impact of the quality of the translation model used on the results is needed.
👍