This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper presents a novel approach to address the long Time-to-First-Token (TTFT) delays caused by the quadratic complexity of vanilla attention in large-scale language models (LLMs) supporting very long context windows. While existing approaches require additional pretraining or fine-tuning and often sacrifice model accuracy, this paper presents a near-lossless sparse attention approach based on theoretical and experimental evidence. We highlight the importance of dynamically and cost-effectively capturing head-specific sparse patterns at runtime. To achieve this, we propose SampleAttention, an adaptive, structured, and near-lossless sparse attention approach. SampleAttention leverages observed significant sparse patterns to focus attention on a fixed percentage of adjacent tokens to capture local window patterns. Furthermore, it employs a two-stage query-based key-value filtering approach that adaptively selects a minimal key-value set at low cost to capture column-stripe patterns. Comprehensive evaluation results show that SampleAttention can replace vanilla attention in traditional LLM with almost no accuracy loss and reduce TTFT by up to 2.42x compared to FlashAttention.
Takeaways, Limitations
•
Takeaways:
◦
We present a novel sparse attention technique that effectively addresses the TTFT delay problem of LLM with long context windows.
◦
Applicable to existing LLMs without additional pretraining or finetuning.
◦
Significantly reduces TTFT compared to FlashAttention with virtually no loss of accuracy.
◦
We present an efficient method for dynamically capturing sparse patterns per head at runtime.
•
Limitations:
◦
Further research is needed to determine how well SampleAttention's performance generalizes across different LLM architectures and context window sizes.
◦
A more comprehensive comparative analysis with other advanced sparse attention techniques is needed.
◦
Lack of performance evaluation for extremely long context windows.