Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Created by
  • Haebom

Author

Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li

Outline

This paper presents REWIRE, a novel method to solve the "data wall" problem, which is the difficulty of securing data for improving the performance of large-scale language models (LLMs). It generates synthetic data through guided rewrite by recycling low-quality web data discarded in the existing data filtering process to improve its quality. The DCLM benchmark experiments on 1B, 3B, and 7B scales show that the performance is improved by 1.0%, 1.3%, and 2.5% compared to using only filtered web data, and it is more effective than using twice as much web data. About 82% of the synthetic data is generated by converting low-quality documents that were previously discarded, and it outperforms other existing synthetic data generation methods (e.g., Wikipedia-style rewriting, question-answer synthesis, and knowledge extraction). This suggests that web text reuse is a simple and effective method for expanding LLM pre-training data.

Takeaways, Limitations

Takeaways:
We present a new possibility for LLM pre-training data augmentation by reusing low-quality web data.
Suggesting the possibility of cost-effective model learning by utilizing data lost in the existing data filtering process.
It shows better performance than existing synthetic data generation methods.
Presenting practical solutions that can help solve the “data wall” problem of LLM.
Limitations:
Further research is needed on the generalizability of the REWIRE method. Its performance needs to be verified for various languages and domains.
Further analysis is needed on the quality and diversity of the generated synthetic data.
When applied to large-scale datasets, review of efficiency and scalability is required.
There may be a lack of detailed algorithmic and parameter descriptions of the guided rewriting process.
👍