The Muon optimizer is consistently faster than Adam in training large-scale language models (LLMs), but the underlying mechanisms remain unclear. This paper elucidates this mechanism from an associative memory perspective. By removing the transformer component optimized by Muon, we reveal that the LLM's associative memory parameters—namely, the Value and Output (VO) attention weights and the feed-forward network (FFN)—are the primary contributors to Muon's superiority. Based on this associative memory perspective, this paper explains Muon's superiority on real-world data with heavy-tailed features. This is due to two key properties: (i) Muon consistently generates more isotropic singular spectra than Adam, and (ii) it optimizes tail classes more effectively than Adam in heavy-tailed data. Furthermore, we theoretically validate these results by analyzing a single-layer associative memory model under class-imbalanced data. This study demonstrated that Muon consistently achieves balanced learning across classes regardless of feature embeddings, while Adam can induce significant imbalances in learning errors depending on the characteristics of the embeddings. In conclusion, empirical observations and theoretical analysis revealed that Muon's core advantage—its update rule, which aligns with the external structure of linear associative memory, enables more balanced and effective learning of tail classes in long-tailed distributions than Adam.