Large-scale neural network pretraining imposes excessive memory requirements on accelerators and often requires expensive communication. In this paper, we introduce Subnetwork Data Parallelism (SDP), a distributed learning framework that partitions models into structured subnetworks trained across workers without exchanging activations. We investigate two complementary masking approaches. Backward masking applies sparsity only in the backward pass to maintain unbiased gradients, while forward masking removes parameters in the forward pass, providing stronger efficiency gains while also providing additional regularization. We also explore two subnetwork configuration strategies, one at the neuron level and one at the block level, applied to CNNs and Transformers. In experiments on CNNs and Transformers on CIFAR and ImageNet, and LLM pretraining on FineWeb, SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance. Notably, in FLOP-consistent settings, forward masking can sometimes achieve better performance.