Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, reduces memory by compressing key-value states into low-dimensional latent vectors. However, in a tensor-parallel processing (TP) environment, each device must load the entire cache, diminishing its advantage over Grouped Query Attention (GQA). In this paper, we propose Tensor-Parallel Latent Attention (TPLA). This approach splits the latent representation and the input dimension of each head across devices, independently performs attention on each fragment, and then combines the results with all-reduce. TPLA improves TP efficiency while retaining the benefits of a compressed KV cache. Unlike Grouped Latent Attention (GLA), all heads in TPLA utilize the entire latent representation, maintaining a stronger representational capacity. TPLA is compatible with models pretrained using MLA, supports MLA-style pre-filling, and enables efficient tensor-parallel decoding without retraining. Applying a simple orthogonal transform (e.g., Hadamard transform or PCA) before TP slicing minimizes cross-shard interference and accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve speedups of 1.79x and 1.93x, respectively, at 32K token context lengths, while maintaining performance on common-sense and LongBench benchmarks. TPLA, when implemented with FlashAttention-3, enables substantial end-to-end acceleration.