This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel
Outline
This paper investigates the scalability of language models, considering the limitations of the amount of text data on the Internet. We conduct extensive experiments to vary the number of data iterations and computational costs using training data of up to 900 billion tokens and a 9 billion parameter model. The experimental results show that up to four data iterations under fixed computational costs produce insignificant changes in loss compared to training with unique data. However, as the number of iterations increases, the effect of adding computational costs eventually converges to zero. In addition, we propose and experimentally verify a scaling law for computational optimization that considers the decreasing value of repeated tokens and excessive parameters. Finally, we experiment with methods to alleviate data shortage, such as adding code data or removing commonly used filters. The models and datasets for a total of 400 training sessions are publicly available at https://github.com/huggingface/datablations .