Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

From Spikes to Heavy Tails: Unveiling the Spectral Evolution of Neural Networks

Created by
  • Haebom

Author

Vignesh Kothapalli, Tianyu Pang, Shenyang Deng, Zongmin Liu, Yaoqing Yang

Outline

This paper addresses the tendency of modern deep neural networks (DNNs) to induce heavy-tailed (HT) empirical spectral density (ESD) in layer weights. While previous studies have shown that the HT phenomenon is correlated with good generalization in large-scale NNs, a theoretical explanation for its occurrence remains lacking. In particular, understanding the conditions that trigger this phenomenon could help elucidate the interplay between generalization and weight spectral density. This study aims to fill this gap by presenting a simple and rich setting for modeling the emergence of HT ESD. Specifically, we present a setting based on the theory that "creates" heavy tails in ESD in two-layer NNs and provide a systematic analysis of the emergence of HT ESD without any gradient noise. This is the first study to analyze noise-free settings and incorporates optimizer (GD/Adam)-dependent (large) learning rates into the analysis of HT ESD. Our results highlight the role of learning rates in the early stages of training for the Bulk+Spike and HT forms of ESD, which can promote generalization in two-layer NNs. These observations, although in a much simpler setup, provide insight into the behavior of large-scale NNs.

Takeaways, Limitations

Takeaways:
We provide a theoretical understanding of the emergence of heavy-tail ESD in two-layer NNs.
We analyze for the first time the emergence of HT ESD in a noise-free setting.
We reveal the effect of learning rate on ESD shape and generalization.
Provides insights into the behavior of large-scale NNs.
Limitations:
The analysis is limited to two-layer NNs.
It may not fully capture the complexity of real-world large-scale NNs.
Generalizability to other training strategies or network structures may be limited.
👍