Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Model Parallelism With Subnetwork Data Parallelism

Created by
  • Haebom

Author

Vaibhav Singh, Zafir Khalid, Edouard Oyallon, Eugene Belilovsky

Subnetwork Data Parallelism (SDP)

Outline

Large-scale neural network pretraining imposes excessive memory requirements on accelerators and often requires expensive communication. In this paper, we introduce Subnetwork Data Parallelism (SDP), a distributed learning framework that partitions models into structured subnetworks trained across workers without exchanging activations. We investigate two complementary masking approaches. Backward masking applies sparsity only in the backward pass to maintain unbiased gradients, while forward masking removes parameters in the forward pass, providing stronger efficiency gains while also providing additional regularization. We also explore two subnetwork configuration strategies, one at the neuron level and one at the block level, applied to CNNs and Transformers. In experiments on CNNs and Transformers on CIFAR and ImageNet, and LLM pretraining on FineWeb, SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance. Notably, in FLOP-consistent settings, forward masking can sometimes achieve better performance.

Takeaways, Limitations

Takeaways:
SDP is an effective way to reduce memory usage when training large-scale neural networks.
Various efficiency and regularization effects can be achieved through backward masking and forward masking.
It can be applied to various model structures through sub-network configuration strategies at the neuron level and block level.
Forward masking in FLOP matching settings can improve performance.
Limitations:
Detailed analysis of the specific performance improvements and reductions of the methods presented in this paper may be lacking.
Further research is needed on the generalizability of SDP to other model architectures or datasets.
There may be some complexity in implementation and tuning.
👍