Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Robust LLM Training Infrastructure at ByteDance

Created by
  • Haebom

Author

Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang

Outline

This paper presents ByteRobust, a large-scale GPU training infrastructure management system for Large Language Models (LLMs). ByteRobust leverages the characteristics of LLM training to efficiently detect and recover from errors that occur during training, providing high fault tolerance and rapid error identification and localization. Deployed across over 200,000 GPU platforms, it achieved an Effective Training Throughput Rate (ETTR) of 97% in a three-month training run on 9,600 GPUs.

Takeaways, Limitations

We emphasize the importance of a large-scale GPU infrastructure management system for the stability of LLM training.
We propose an error detection and recovery strategy that utilizes the characteristics of LLM training.
Improves training efficiency through high fault tolerance, fast error identification and localization.
The effectiveness of the system is proven through real-world application cases on over 200,000 GPU platforms.
It specifically demonstrates improved training efficiency by achieving ETTR of 97%.
Limitations may lack specific technical details or information on applicability to various LLM models.
Limited information makes it difficult to assess generalizability to other training settings.
👍