Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

Created by
  • Haebom

Author

Yuandong Tian

Outline

We propose a novel framework, $\mathbf{Li_2}$, for the grokking phenomenon. This framework captures the grokking behavior of two-layer nonlinear networks in three stages: (I) lazy learning, (II) independent feature learning, and (III) interactive feature learning. $\mathbf{Li_2}$ illuminates the impact of key hyperparameters, such as weight decay, learning rate, and sample size, on grokking. It also presents provable scaling laws for feature emergence, memorization, and generalization, and demonstrates the effectiveness of state-of-the-art optimizers such as Muon.

Takeaways, Limitations

Takeaways:
A new framework $\mathbf{Li_2}$ explaining the Grokking phenomenon is presented.
Analysis of the grokking process of a 2-layer network in three steps.
Investigating the relationship between weight decay, learning rate, sample size, and grokking.
Scaling laws for feature emergence, memorization, and generalization are presented.
Clarifying the principles of the effectiveness of optimizers such as Muon.
Scalable to multi-layer structure.
Limitations:
Specific Limitations is not stated in the abstract. (Please refer to the original text of the paper.)
👍