Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

Created by
  • Haebom

Author

Ziji Shi, Le Jiang, Ang Wang, Jie Zhang, Chencan Wu, Yong Li, Xiaokui Xiao, Wei Lin, Jialin Li

Outline

This paper presents an automatic parallelization framework, TAPAS, to address the challenges of automatically determining tensor parallelization strategies essential for distributed training of large-scale neural networks. This framework addresses the exponentially growing search space of existing methods by efficiently reducing the search space using a divide-and-conquer approach, leveraging the recurrent substructure of neural networks. This approach achieves sublinear complexity with respect to model size, providing a scalable solution applicable to large-scale network training. Experimental results demonstrate that TAPAS achieves search speeds up to 160x faster than existing state-of-the-art automatic parallelization frameworks, and the performance of the derived strategy is comparable to or better than that of the expert-designed Megatron-LM library.

Takeaways, Limitations

Takeaways:
We present TAPAS, an efficient framework for automatic tensor parallelization of large-scale neural networks.
Improving the existing exponential complexity problem to sub-linear complexity.
Achieve overwhelmingly faster search speeds (up to 160x faster) than existing high-performance automatic parallelization frameworks.
Automatically achieve expert-level performance
Limitations:
TAPAS's performance improvements may depend on the specific type of neural network architecture. Generalization performance across various architectures needs to be evaluated.
Experimental results may be limited to specific models and hardware environments, so performance verification in other environments is necessary.
The optimality of automatically generated tensor parallel strategies may still vary depending on the model and hardware configuration.
👍