Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?

Created by
  • Haebom

Author

Grgur Kova\v{c}, J er emy Perez, R emy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer

Outline

This paper studies the phenomenon of model collapse that occurs during iterative training of a large-scale language model (LLM) using synthetic data generated by the LLM. Specifically, we empirically analyze the impact of human data characteristics on this distributional shift. Using various human datasets, we conduct iterative training and, through manipulation of dataset characteristics and regression analysis, identify data characteristics that predict the magnitude of distributional shift. We find that lexical diversity amplifies distributional shift, while semantic diversity and data quality mitigate it. Furthermore, we demonstrate that these effects are modular, meaning that data collected from a specific Internet domain has little influence on content creation in other domains. Finally, experiments on political bias demonstrate that human data characteristics influence whether initial biases are amplified or reduced. Ultimately, we present a novel perspective on how different parts of the Internet can experience different types of distributional shift.

Takeaways, Limitations

Takeaways:
We identify data characteristics (lexical diversity, semantic diversity, and data quality) that predict the magnitude of distribution shifts that occur during the iterative learning process of LLM.
We present the modularity of the impact of domain characteristics of Internet data on content creation in LLM.
Analyzing the impact of human data characteristics on political bias in LLMs.
Demonstrates the diversity of distributional shifts occurring across different areas of the Internet.
Limitations:
Restrictions on the type and scope of datasets and features used in the analysis.
Further research is needed on the generalizability of quantitative measurements of distributional shifts and prediction models.
Generalizability verification is needed for various LLM architectures and learning methodologies.
Further research is needed on the scope and limits of modularity, which means that the influence of a specific domain does not extend to other domains.
👍