Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Conda: Column-Normalized Adam for Training Large Language Models Faster

Created by
  • Haebom

Author

Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin

Summary of the Column-Normalized Adam (Conda) Paper

Outline

Conda is a novel optimizer developed to improve the training efficiency of large-scale language models (LLMs). It combines the fast convergence of Adam with the spectral regularization of Muon to mitigate Adam's drawback of spectral instability while maintaining coordinate-specific adaptability. Conda projects updates onto an orthogonal space and applies column-wise second-moment regularization based on the projected gradients. In experiments on the LLaMA and GPT-2 series, Conda consistently outperformed other optimizers, such as AdamW and Muon, achieving convergence 2-2.5 times faster than AdamW in the LLaMA series.

Takeaways, Limitations

Takeaways:
Conda is an effective optimizer that significantly improves the convergence speed of LLM training.
It outperforms various based optimizers such as AdamW and Muon.
It demonstrates robust performance in a variety of training environments.
Limitations:
The Limitations of this paper is not specified.
👍