In this paper, we propose StagFormer (Staggered Transformer), a novel architecture for parallelizing the decoding process of Transformer-based language models. Unlike the sequential decoding approach of conventional Transformers, StagFormer parallelizes the decoding process along the depth of the model by staggering execution along the sequence axis. This is achieved by disabling the token representation at the $i$-th time step in the $l$ layer from relying on the token representations up to the $i$-th time step in the $l-1$ layer, and instead relying only on the token representations up to the $i-1$-th time step. This allows for parallel execution of different sections of the model, thereby improving decoding speed while maintaining quality. We also explore various extensions, including weight sharing, limited window attention, multi-section extensions, and recurrent model approximation.