Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Created by
  • Haebom

Author

Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Luo Ji

Outline

This paper explores how large-scale language models (LLMs) can acquire new language abilities and adapt to new domains through continuous pretraining (CPT). Specifically, we systematically analyze the impact of optimal selection of key hyperparameters, such as the mixing ratio of additional languages or domain corpora, on model performance. We perform CPT to improve Chinese proficiency using the Llama-3 8B and 70B models, and study the optimal correlation between the additional language mixing ratio (ALMR) and learning rate (LR) in the 8B model to derive optimal experimental settings. Through careful selection and fine-tuning of hyperparameters, we improve model performance not only on Chinese-related benchmarks but also in specific domains such as mathematics, coding, and emotional intelligence. We deploy the final 70B model in a real-world chat system, achieving satisfactory performance.

Takeaways, Limitations

Takeaways:
We present an experimental setup to improve the efficiency of CPT by analyzing the optimal correlation between the additional language mixing ratio (ALMR) and the learning rate (LR).
Experimentally verified the Llama-3 model's improvement in Chinese language proficiency and performance in various domains, including mathematics, coding, and emotional intelligence.
Successfully deploying the 70B model in a real chat system to demonstrate its practicality.
Limitations:
The study was limited to the Llama-3 model, which may limit generalizability to other LLM models.
Further research is needed to determine whether the optimal correlation between ALMR and LR derived from the 8B model can be equally applied to models of other sizes, such as the 70B model.
Lack of specific performance metrics and analysis of actual chat system deployment results.
👍