Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

Created by
  • Haebom

Author

Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge

Outline

This paper introduces Compressed Convolutional Attention (CCA), a novel attention method proposed to reduce the training and serving costs of long context transformers. CCA performs full attention operations within a shared latent space by downprojecting queries, keys, and values, thereby compressing parameters, KV-cache, and FLOPs. Furthermore, we propose Compressed Convolutional Group Query Attention (CCGQA), which combines CCA with head sharing, to further improve computational and bandwidth efficiency. Experimental results demonstrate that CCGQA outperforms GQA and MLA, achieving 8x the KV-cache compression compared to conventional MHA in MoE models without any performance degradation.

Takeaways, Limitations

Takeaways:
CCA and CCGQA efficiently reduce the computational cost, parameter count, and KV-cache size of long context transformers, thereby improving training and inference speed.
CCGQA outperforms GQA and MLA, and is also effective in the MoE model.
CCA/CCGQA shows significant prefill and backward speedups compared to MHA on H100 GPUs.
Limitations:
The paper does not specify a limit on the KV-cache compression ratio without specific performance degradation.
Further research is needed on generalization performance for other models and datasets.
There is no mention of the complexity of the paper's methodology or the difficulty of its implementation.
👍