In this paper, we present ATTENTION2D, a novel method that leverages parallel processing along both query and key/value dimensions to address the computational and memory overhead of the self-attention mechanism in Transformer-based models. ATTENTION2D achieves relatively fast training and inference speedups compared to existing methods without using approximations or incurring additional computational or memory overhead, and it scales effectively on many processing units. Experimental results using a GPT-3-like model show up to 5x and 9.4x performance improvements over Ring Attention on multiple NVIDIA A100 and H100 GPUs.