Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference

Created by
  • Haebom

Author

Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang

Outline

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, reduces memory by compressing key-value states into low-dimensional latent vectors. However, in a tensor-parallel processing (TP) environment, each device must load the entire cache, diminishing its advantage over Grouped Query Attention (GQA). In this paper, we propose Tensor-Parallel Latent Attention (TPLA). This approach splits the latent representation and the input dimension of each head across devices, independently performs attention on each fragment, and then combines the results with all-reduce. TPLA improves TP efficiency while retaining the benefits of a compressed KV cache. Unlike Grouped Latent Attention (GLA), all heads in TPLA utilize the entire latent representation, maintaining a stronger representational capacity. TPLA is compatible with models pretrained using MLA, supports MLA-style pre-filling, and enables efficient tensor-parallel decoding without retraining. Applying a simple orthogonal transform (e.g., Hadamard transform or PCA) before TP slicing minimizes cross-shard interference and accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve speedups of 1.79x and 1.93x, respectively, at 32K token context lengths, while maintaining performance on common-sense and LongBench benchmarks. TPLA, when implemented with FlashAttention-3, enables substantial end-to-end acceleration.

Takeaways, Limitations

Takeaways:
We propose TPLA, a novel attention mechanism for efficient long-context processing in tensor-parallel processing environments.
Improves TP efficiency while maintaining the advantages of MLA.
Compatible with pre-trained MLA models and can be used without retraining.
Demonstrating the potential for practical end-to-end acceleration through integration with FlashAttention-3.
Significant speedups achieved over DeepSeek-V3 and Kimi-K2.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Performance evaluation on various model architectures and hardware platforms is required.
Further research is needed on the optimal selection of orthogonal transformations and parameter tuning.
👍