Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Created by
  • Haebom

Author

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, Chao Yang

Outline

This paper presents a novel approach to address the long Time-to-First-Token (TTFT) delays caused by the quadratic complexity of vanilla attention in large-scale language models (LLMs) supporting very long context windows. While existing approaches require additional pretraining or fine-tuning and often sacrifice model accuracy, this paper presents a near-lossless sparse attention approach based on theoretical and experimental evidence. We highlight the importance of dynamically and cost-effectively capturing head-specific sparse patterns at runtime. To achieve this, we propose SampleAttention, an adaptive, structured, and near-lossless sparse attention approach. SampleAttention leverages observed significant sparse patterns to focus attention on a fixed percentage of adjacent tokens to capture local window patterns. Furthermore, it employs a two-stage query-based key-value filtering approach that adaptively selects a minimal key-value set at low cost to capture column-stripe patterns. Comprehensive evaluation results show that SampleAttention can replace vanilla attention in traditional LLM with almost no accuracy loss and reduce TTFT by up to 2.42x compared to FlashAttention.

Takeaways, Limitations

Takeaways:
We present a novel sparse attention technique that effectively addresses the TTFT delay problem of LLM with long context windows.
Applicable to existing LLMs without additional pretraining or finetuning.
Significantly reduces TTFT compared to FlashAttention with virtually no loss of accuracy.
We present an efficient method for dynamically capturing sparse patterns per head at runtime.
Limitations:
Further research is needed to determine how well SampleAttention's performance generalizes across different LLM architectures and context window sizes.
A more comprehensive comparative analysis with other advanced sparse attention techniques is needed.
Lack of performance evaluation for extremely long context windows.
👍