This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Conda: Column-Normalized Adam for Training Large Language Models Faster
Created by
Haebom
Author
Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin
Summary of the Column-Normalized Adam (Conda) Paper
Outline
Conda is a novel optimizer developed to improve the training efficiency of large-scale language models (LLMs). It combines the fast convergence of Adam with the spectral regularization of Muon to mitigate Adam's drawback of spectral instability while maintaining coordinate-specific adaptability. Conda projects updates onto an orthogonal space and applies column-wise second-moment regularization based on the projected gradients. In experiments on the LLaMA and GPT-2 series, Conda consistently outperformed other optimizers, such as AdamW and Muon, achieving convergence 2-2.5 times faster than AdamW in the LLaMA series.
Takeaways, Limitations
•
Takeaways:
◦
Conda is an effective optimizer that significantly improves the convergence speed of LLM training.
◦
It outperforms various based optimizers such as AdamW and Muon.
◦
It demonstrates robust performance in a variety of training environments.