This paper introduces GradES, a novel gradient-based early-stopping method proposed to improve the training speed of large-scale Transformer models. GradES tracks the gradient change of each component (attention projection and feed-forward layer matrices) within the Transformer to address the computational cost of existing early-stopping methods that monitor the validation loss of the entire model. When the gradient change of a specific matrix falls below a convergence threshold, it stops updating that matrix, eliminating unnecessary validation steps and preventing overfitting. As a result, GradES achieves an average accuracy improvement of 1.2% on language tasks and 3.88% on multi-modal benchmarks while reducing training time by 1.57 to 7.22 times.