Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Latent Multi-Head Attention for Small Language Models

Created by
  • Haebom

Author

Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

Outline

This paper presents the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing an interesting trade-off between efficiency and quality. We train a 30 million-parameter GPT model on a dataset of 100,000 synthetic stories and benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotated position embedding (RoPE) (MLA+RoPE). Our main results show that MLA+RoPE with semi-hierarchical latent dimension (r=d/2) reduces KV-cache memory usage by 45% while increasing validation loss by only 0.3% (essentially the same quality as MHA), achieving Pareto improvement when deployed in memory-constrained environments. We also show that RoPE is important for MLA for small models. Without RoPE, MLA performs 3-5% worse than basic attention, but with RoPE, it performs 2% better. Inference benchmarks on NVIDIA A100 GPUs show that MLA with r=d/2 achieves 1.4x speedup over full-layer MLA while maintaining memory savings. GPT-4 evaluation shows that it achieves the highest quality score (7.4/10) on grammar, creativity, and consistency metrics. Code and models will be made public upon acceptance.

Takeaways, Limitations

Takeaways:
We present an MLA+RoPE architecture that simultaneously improves memory efficiency and performance in small language models.
MLA+RoPE with semi-hierarchical latent dimension (r=d/2) operates without performance degradation while reducing memory usage by 45%.
We confirm that RoPE is essential for improving MLA performance in compact models.
MLA improves inference speed.
Achieving high quality scores in GPT-4 evaluation.
Limitations:
Generalization performance on real datasets requires further validation using a synthetic story dataset of 100,000 stories.
The code and model are not yet public.
The experiments were limited to a GPT model of a certain size (30 million parameters), and generalizability to models of other sizes requires further study.
👍