Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Created by
  • Haebom

Author

Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui

Outline

Multi-head Latent Attention (MLA), proposed by DeepSeek, is an innovative architecture that compresses the Key-Value (KV) cache into latent vectors, enabling efficient and cost-effective inference. This paper proposes MHA2MLA, the first data-efficient fine-tuning method for transitioning from MHA to MLA. MHA2MLA incorporates partial-RoPE and low-rank approximation, and can recover performance even on small datasets through a joint SVD approximation based on the parameters of a pre-trained model. This reduces inference costs and enables integration with compression techniques such as KV cache quantization. For the Llama2-7B model, we achieve a 92.19% reduction in KV cache size while reducing LongBench performance by only 0.5%.

Takeaways, Limitations

Takeaways:
A data-efficient fine-tuning method is proposed for efficient conversion from MHA to MLA.
Performance recovery possible even with small datasets (0.3% to 0.6%).
Inference cost reduction and integration with KV cache quantization.
Minimized performance degradation while reducing KV cache size by 92.19% on the Llama2-7B model.
Limitations:
Further research is needed to determine the generalizability of the methodology presented in this paper and its applicability to other LLM architectures.
Performance verification of MHA2MLA is required on various datasets and working environments.
Further research is needed to determine and tune the optimal parameters of the proposed methodology.
👍