Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

Created by
  • Haebom

Author

Yuqing Wang, Shangding Gu

Outline

Data selection plays a critical role in data-driven decision-making, including large-scale language models (LLMs), and is typically task-dependent. Data quality and diversity have been extensively studied and are known to improve model performance. This paper demonstrates that selecting more uniformly distributed data can improve performance while enhancing training efficiency. Specifically, we demonstrate that a more uniform (and therefore less biased) distribution leads to a larger minimum pairwise distance ($h_{\min}$) between data points, and demonstrate that a smaller $h_{\min}$ can slow down the training dynamics of gradient descent (GD). Furthermore, we theoretically demonstrate that the approximation error of a neural network decreases as $h_{\min}$ increases. This study introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) that does not require Lipschitz smoothness and is applicable to a wide range of architectures, including transformers. This framework provides a theoretical basis for the use of residual connections and function synthesis in deep neural architectures. We conducted comprehensive experiments to fine-tune supervised learning across a variety of settings (including different optimization strategies, model sizes, and training datasets). The results consistently demonstrate that selecting data by maximizing pairwise distances significantly accelerates LLM training and achieves comparable or better performance across diverse datasets.

Takeaways, Limitations

Takeaways:
We demonstrate that uniformly distributed data selection can improve LLM training efficiency and performance.
Quantify data uniformity using minimum pairwise distance ($h_{\min}$) and relate it to training speed and performance.
Development of a GD convergence framework for general neural network architectures (including transformers) beyond NTK.
Provides theoretical basis for deep architecture design such as residual connectivity and function synthesis.
We demonstrate the effectiveness of our methodology through fine-tuning experiments on supervised learning in various settings.
Limitations:
Lack of detailed explanation of specific data selection methodology.
Lack of discussion of the computational complexity required to actually compute and apply $h_{\min}$.
Further research is needed to determine how general the proposed methodology is to other types of deep learning models and tasks.
It is possible that the results of this study are limited to a specific dataset.
👍