[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

Created by
  • Haebom

Author

Aryan Sajith, Krishna Chaitanya Rao Kathala

Outline

This study experimentally analyzes the relative impact of training data quality and quantity on the performance of small-scale language models (SLMs) using the TinyStories dataset. We conducted experiments by varying the dataset size (25% and 50% of the original) and the redundancy rate (25%, 50%, 75%, and 100%). The results of evaluating the model performance through validation loss, accuracy, and perplexity metrics show that the quality of training data plays a more important role in the overall performance of SLMs, especially considering the scale of this experiment. While minimal redundancy slightly improved the model accuracy (0.87% increase in accuracy at 25% redundancy), excessive redundancy resulted in a decrease in performance (40% decrease in accuracy at 100% redundancy). This study provides Takeaways that can contribute to the democratization of AI technology by considering the economic and environmental issues of large-scale model training beyond model performance.

Takeaways, Limitations

Takeaways:
We empirically demonstrate that data quality is more important than quantity in improving the performance of small-scale language models.
An appropriate level of data redundancy can contribute to improved model performance, but excessive redundancy can actually cause performance degradation.
A data quality-centric approach can address the cost and environmental challenges of large-scale model training and increase accessibility to AI technology.
Limitations:
Since we only conducted our experiments using the TinyStories dataset, the generalizability to other datasets may be limited.
Lack of detailed description of the types and structures of small-scale language models used in the analysis.
A clear explanation of how data queries are defined and measured is needed.
👍