This paper proposes a mathematical framework to characterize how, under what conditions, and for what features grokking, a delayed generalization phenomenon, occurs in complex inputs. We propose a novel framework, $\mathbf{Li_2}$, that captures the grokking behavior of two-layer nonlinear networks. This framework comprises three stages: (I) lazy learning, (II) independent feature learning, and (III) interactive feature learning. This paper elucidates the role of key hyperparameters, such as weight decay, learning rate, and sample size, on grokking; verifiable scaling laws for memory and generalization; and the underlying principles that drive the effectiveness of Muon-like optimizers.