This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Created by
Haebom
Author
Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li
Outline
This paper proposes REWIRE (REcycling the Web with guided REwrite), a method that recycles low-quality web text discarded during conventional filtering processes to address the "data wall" problem, a challenge faced by large-scale language models. REWIRE rewrites low-quality documents to make them useful for training and expands the pre-training dataset by increasing the proportion of synthetic data. Experimental results on DCLM benchmarks at 1B, 3B, and 7B scales demonstrate that models trained using a mixture of high-quality original text and rewritten text outperform models using only filtered web data by 1.0%, 1.3%, and 2.5%, respectively, demonstrating greater performance than models trained using twice the amount of web data. Analysis reveals that approximately 82% of the mixed text is derived from previously discarded low-quality documents, outperforming other synthetic data generation methods such as Wikipedia-style paraphrasing, question-answer synthesis, and knowledge extraction. Therefore, reusing web text suggests a simple and effective method for pre-training data expansion. High-quality synthetic data are available in https://huggingface.co/datasets/facebook/recycling_the_web .
We propose a potential solution to the problem of securing data required for large-scale language model training by reusing low-quality web data discarded by existing filtering processes.
◦
We experimentally demonstrate that the REWIRE technique can effectively generate synthetic data to expand the size of the pre-training dataset and improve model performance.
◦
We demonstrate the effectiveness of REWIRE by outperforming other existing synthetic data generation methods.
◦
We make the generated high-quality synthetic datasets publicly available to enable other researchers to utilize them.
•
Limitations:
◦
REWIRE's performance improvements are for a specific benchmark (DCLM) and a specific model size, and do not guarantee the same performance improvements for other benchmarks or model sizes.
◦
There is a lack of in-depth analysis of the biases and errors that can arise during the process of converting low-quality data into high-quality data.
◦
There is a lack of analysis on the computational costs of the data reuse process. Further research is needed to determine how cost-effective the rewriting process itself is compared to the cost of pretraining.