Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

Created by
  • Haebom

Author

Brando Miranda, Alycia Lee, Sudharsan Sundar, Allison Casasola, Rylan Schaeffer, Elyas Obbad, Sanmi Koyejo

Outline

In this paper, we present a quantitative measure of data quality, especially data diversity, in large-scale language model (LLM) pre-training. Previous LLM pre-training studies have mainly focused on model and dataset size expansion, but the importance of data quality has not been clearly defined. In response, we propose a metric called 'diversity coefficient' to measure the diversity of natural language data and the diversity of publicly available pre-training datasets. Through experiments on 44 models (in total) of various sizes (from 51M to 7B parameters) using GPT-2 and LLaMAv2, we show that the proposed diversity coefficient is correlated with the downstream model evaluation performance. In conclusion, the diversity coefficient is an important aspect of data quality and captures the causal relationship between data diversity and improved model performance.

Takeaways, Limitations

Takeaways:
We present a new index (diversity coefficient) to quantitatively measure the diversity of LLM pre-training data.
Experimentally prove that the diversity coefficient is closely related to the downstream operation performance of LLM.
Presenting new directions for improving data quality.
Consistent results across models of different sizes.
Limitations:
Diversity coefficients may not cover all aspects of data quality (need to consider other factors besides diversity)
Since these are experimental results for a specific dataset and model, further research is needed to determine generalizability.
Diversity coefficients can be expensive to compute.
Further research is needed on how to create datasets that optimize diversity coefficients.
👍