To address the memory and communication costs associated with large-scale neural network pretraining, we propose Subnetwork Data Parallelism (SDP), a distributed training framework that splits the model across workers and trains it without exchanging activations. SDP explores two masking methods: backward masking, which applies sparsity only in the backward pass to maintain unbiased gradients, and forward masking, which eliminates parameters in the forward pass to improve efficiency and provide regularization. Furthermore, we explore two subnetwork configuration strategies, one at the neuron level and one at the block level, applicable to CNNs and Transformers. Experiments on CNNs and Transformers on CIFAR, ImageNet, and LLM pretraining on FineWeb demonstrate that SDP maintains or improves performance while reducing per-device memory usage by 30%-75%. Specifically, forward masking achieves superior performance in settings with consistent FLOPs.