Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Fantastic Pretraining Optimizers and Where to Find Them

Created by
  • Haebom

Author

Kaiyue Wen, David Hall, Tengyu Ma, Percy Liang

Outline

This paper presents the results of a systematic study of the speedup claims of optimization algorithms that can replace AdamW for large-scale language model pretraining. We highlight the problems that previous studies have skewed their comparisons due to unfair hyperparameter tuning and limited evaluation settings, and compare ten optimization algorithms across four different model sizes and data-to-model ratios. Our results demonstrate that rigorous hyperparameter tuning and end-of-training evaluations for various model sizes and data-to-model ratios are essential for fair comparisons. Furthermore, we find that the claimed speedups in previous studies are actually lower and tend to decrease with increasing model size. Specifically, we find that the fastest optimization algorithms, such as Muon and Soap, utilize matrix preprocessors, yet their speedup decreases inversely with model size.

Takeaways, Limitations

Takeaways:
Questions are raised about the reliability of existing research results on the speedup of optimization algorithms in large-scale language model pre-training.
We present rigorous hyperparameter tuning and comprehensive evaluation methods for fair comparison of optimization algorithms.
We find that the speedup of optimization algorithms using matrix-based preprocessors decreases with model size.
We experimentally demonstrate that the speedup over AdamW becomes minimal as the model size increases.
Limitations:
The optimization algorithms, model size, and data-to-model ratio considered in this study may be limited.
Further research is needed on generalizability to other types of language models or tasks.
More sophisticated comparisons may be required by exploring a wider hyperparameter space.
👍