Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Muon Outperforms Adam in Tail-End Associative Memory Learning

Created by
  • Haebom

Author

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, Vincent YF Tan

Outline

The Muon optimizer is consistently faster than Adam in training large-scale language models (LLMs), but the underlying mechanisms remain unclear. This paper elucidates this mechanism from an associative memory perspective. By removing the transformer component optimized by Muon, we reveal that the LLM's associative memory parameters—namely, the Value and Output (VO) attention weights and the feed-forward network (FFN)—are the primary contributors to Muon's superiority. Based on this associative memory perspective, this paper explains Muon's superiority on real-world data with heavy-tailed features. This is due to two key properties: (i) Muon consistently generates more isotropic singular spectra than Adam, and (ii) it optimizes tail classes more effectively than Adam in heavy-tailed data. Furthermore, we theoretically validate these results by analyzing a single-layer associative memory model under class-imbalanced data. This study demonstrated that Muon consistently achieves balanced learning across classes regardless of feature embeddings, while Adam can induce significant imbalances in learning errors depending on the characteristics of the embeddings. In conclusion, empirical observations and theoretical analysis revealed that Muon's core advantage—its update rule, which aligns with the external structure of linear associative memory, enables more balanced and effective learning of tail classes in long-tailed distributions than Adam.

Takeaways, Limitations

Muon optimizer is faster and more effective than Adam for LLM training.
Muon's performance improvements are related to LLM's associative memory parameters, such as VO attention weights and FFN.
Muon performs tail class learning more effectively on long-tailed data.
Muon has a more isotropic singular spectrum than Adam
Theoretical analysis demonstrates Muon's balanced learning ability on class-imbalanced data.
(Limitations is not specified in the paper)
👍