This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning
Created by
Haebom
Author
Andrei Mircea, Supriyo Chakraborty, Nima Chitsazan, Milind Naphade, Sambit Sahu, Irina Rish, Ekaterina Lobacheva
Outline
This paper aims to understand the impact of language model scalability on training dynamics, particularly loss decay. We find that language models exhibit loss decay, which is a phenomenon in which the rate of loss decay slows down rapidly in the early stages of training, as indicated by the piecewise linear behavior of the loss curve in log-log space. Model scaling mitigates this transition by (1) reducing the loss value at which the decay occurs, and (2) improving the rate of log-log loss improvement after the decay. We describe loss decay as a type of degenerate training dynamics called zero-sum learning (ZSL). In ZSL, the slopes of each example are systematically reversed, causing destructive interference in the loss changes of each example. As a result, loss improvement for one set of examples degrades the loss for another set, thereby bottlenecking overall progress. Loss decay and ZSL provide new insights into the training dynamics of language model scaling laws, and can be directly targeted to improve language models at any scale. Code and results are available at https://github.com/mirandrom/zsl .