Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TrainVerify: Equivalence-Based Verification for Distributed LLM Training

Created by
  • Haebom

Author

Yunchi Lu, Youshan Miao, Cheng Tan, Peng Huang, Yi Zhu, Xian Zhang, Fan Yang

Outline

In this paper, we propose TrainVerify, a system for verifying errors that may occur during distributed training of large-scale language models (LLMs). Distributed training of LLMs using thousands of devices is extremely expensive, and unverified training can potentially waste millions of GPU hours. TrainVerify formally verifies whether distributed parallel execution plans are mathematically equivalent based on the logical specification of the model. Considering that direct verification is difficult due to the large size of LLMs, TrainVerify introduces shape reduction techniques and step-by-step parallel verification algorithms to significantly reduce complexity while maintaining formal correctness. It has been successfully applied to verify training plans of state-of-the-art LLMs such as Llama3 (405B) and DeepSeek-V3 (671B).

Takeaways, Limitations

Takeaways:
It can increase the reliability of large-scale language model learning.
Early detection of errors occurring during distributed learning can prevent cost waste.
Provides a practical method for validating learning plans for cutting-edge LLMs.
Limitations:
TrainVerify's performance relies on the efficiency of the shape reduction technique and the step-by-step parallel verification algorithm. Performance degradation may occur in more complex LLM or distributed learning environments.
The logical specification of the model must be correct to ensure the reliability of the verification results. If there are errors in the specification itself, TrainVerify may not detect them.
The scale of models validated to date is limited, and there is a need to confirm their applicability to larger models.
👍