Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

Created by
  • Haebom

Author

Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

AutoScale: A Scale-Aware Data Composition Framework for LLM Pre-training

Outline

This paper presents research on domain reweighting, which adjusts the relative weights of various data sources to improve the efficiency and effectiveness of LLM pretraining. Specifically, we highlight that data blending that performs well in small-scale experiments may not maintain its benefits at scale. To address this, we propose AutoScale, a two-stage scale-aware data composition framework. AutoScale first fits a parametric model that predicts the model's loss under various data configurations and then uses this model to find the optimal allocation within a smaller budget. Then, leveraging novel theoretical analysis of how the optimal configuration evolves with scale, it extrapolates this configuration to larger budgets without additional retraining. AutoScale accelerates convergence and improves downstream performance. It achieves 28% faster perplexity reduction than existing methods and up to 38% faster than unweighted training when pretraining the GPT-2 Large model. Furthermore, it achieves the best average performance across various downstream tasks.

Takeaways, Limitations

Takeaways:
In LLM pre-learning, we emphasize the importance of data organization and show that data importance changes with scale.
We point out the limitations of the existing method of directly applying small-scale experimental results to large-scale learning, and suggest the need for data organization that takes scale into account.
Demonstrated convergence speed and downstream performance improvements through the AutoScale framework.
Increase accessibility of research by providing open source code.
Limitations:
There is no Limitations specifically mentioned in this paper.
👍