Data selection plays a critical role in data-driven decision-making, including large-scale language models (LLMs), and is typically task-dependent. Data quality and diversity have been extensively studied and are known to improve model performance. This paper demonstrates that selecting more uniformly distributed data can improve performance while enhancing training efficiency. Specifically, we demonstrate that a more uniform (and therefore less biased) distribution leads to a larger minimum pairwise distance ($h_{\min}$) between data points, and demonstrate that a smaller $h_{\min}$ can slow down the training dynamics of gradient descent (GD). Furthermore, we theoretically demonstrate that the approximation error of a neural network decreases as $h_{\min}$ increases. This study introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) that does not require Lipschitz smoothness and is applicable to a wide range of architectures, including transformers. This framework provides a theoretical basis for the use of residual connections and function synthesis in deep neural architectures. We conducted comprehensive experiments to fine-tune supervised learning across a variety of settings (including different optimization strategies, model sizes, and training datasets). The results consistently demonstrate that selecting data by maximizing pairwise distances significantly accelerates LLM training and achieves comparable or better performance across diverse datasets.