Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Synthetic bootstrapped pretraining

Created by
  • Haebom

Author

Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Candes , Chong Wang, Ruoming Pang

Outline

Synthetic Bootstrapped Pretraining (SBP) is a novel method for pretraining language models. Unlike traditional language model pretraining methods that focus on learning causal relationships between tokens within a single document, it models relationships between documents, generates a new, large-scale synthetic dataset, and utilizes this dataset for pretraining. SBP pretrains a 3 billion-parameter model using 1 trillion tokens of data, outperforming a simple iterative baseline model and achieving significant performance improvements over ideal scenarios with 20x more unique data. Qualitative analysis reveals that the synthesized documents are not simply paraphrased, but rather extract core concepts from the original documents to generate new narratives. From a Bayesian perspective, this can be interpreted as a process of abstracting shared latent concepts between related documents.

Takeaways, Limitations

Takeaways:
A new dictionary learning method that leverages inter-document relationships is presented.
Presenting the possibility of efficient data utilization and improved performance compared to existing methods.
Improving language model performance by improving the quality of synthetic data.
Presenting the possibility of interpretation from a Bayesian perspective
Limitations:
Further validation of the generalization performance of the proposed method is needed.
Further research is needed on the transparency and controllability of the synthetic data generation process.
The need for large datasets and computing resources
👍