Distributed pretraining of large-scale models imposes excessive memory requirements on individual nodes and incurs inter-node communication costs. This paper proposes a novel alternative approach that reduces memory requirements by training small, structured subnetworks of the model on each worker. Unlike pipelining, this approach avoids inter-node activation communication and maintains bandwidth requirements similar to or lower than standard all-reduce-based data-parallel communication schemes. This paper evaluates two subnetwork configuration strategies based on the principle of ensuring uniform representation of each parameter across distributed training settings. The results show that the stochastic block-dropping technique consistently outperforms breadth-wise subnetwork configurations explored in existing federated learning. We empirically attribute this superior performance to stronger gradient sorting in blocks that maintain skip connections. Preliminary experimental results highlight the potential of this approach to reduce memory usage by 20-40% without compromising performance.