This paper aims to improve understanding of the pre-training pipeline for large-scale language models (LLMs), specifically distributed training, managing large datasets across hundreds of nodes, and scaling data parallelism to fully utilize available GPU compute capacity. While cutting-edge AI research firms are investing billions of dollars in supercomputing infrastructure to train increasingly large models on massive datasets, information on performance scaling and training considerations for these large-scale training pipelines is scarce in the public literature. Therefore, this paper aims to provide practical recommendations for tuning training performance when scaling large-scale language models.