Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Model Parallelism With Subnetwork Data Parallelism

Created by
  • Haebom

Author

Vaibhav Singh, Zafir Khalid, Edouard Oyallon, Eugene Belilovsky

Outline

To address the memory and communication costs associated with large-scale neural network pretraining, we propose Subnetwork Data Parallelism (SDP), a distributed training framework that splits the model across workers and trains it without exchanging activations. SDP explores two masking methods: backward masking, which applies sparsity only in the backward pass to maintain unbiased gradients, and forward masking, which eliminates parameters in the forward pass to improve efficiency and provide regularization. Furthermore, we explore two subnetwork configuration strategies, one at the neuron level and one at the block level, applicable to CNNs and Transformers. Experiments on CNNs and Transformers on CIFAR, ImageNet, and LLM pretraining on FineWeb demonstrate that SDP maintains or improves performance while reducing per-device memory usage by 30%-75%. Specifically, forward masking achieves superior performance in settings with consistent FLOPs.

Takeaways, Limitations

Takeaways:
SDP significantly reduces memory usage when training large models.
SDP improves efficiency without compromising performance, or while improving performance.
Forward masking provides additional regularization and may perform better in FLOP-matched settings.
Applicable to CNN and Transformer, various datasets, and LLM pre-training.
Limitations:
Limitations, as stated in the paper, was not presented.
👍