Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Created by
  • Haebom

Author

Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali

Outline

To address the high computational cost of excessive token generation in large-scale inference models, this paper proposes LessIsMore, a novel sparse attention mechanism that requires no training. Instead of relying on traditional head-specific local optimization, LessIsMore leverages global attention patterns to integrate token selection across each attention head and combines it with recent contextual information to generate a unified cross-head token ranking for future decoding layers. This eliminates the need to maintain separate token subsets for each head, improving generalization performance and efficiency. Evaluations on various inference tasks and benchmarks demonstrate that LessIsMore achieves an average 1.1x decoding speedup compared to full attention while maintaining or improving accuracy. Furthermore, by focusing attention on twice as many tokens without compromising accuracy, LessIsMore achieves a 1.13x end-to-end speedup compared to existing sparse attention methods.

Takeaways, Limitations

Takeaways:
We demonstrate that a sparse attention mechanism that requires no training can effectively reduce the computational cost of large-scale inference models.
We demonstrate that it is possible to address the accuracy degradation problem of existing sparse attention methods and, in fact, improve or maintain accuracy while improving speed.
An integrated token selection method leveraging global attention patterns contributes to improved generalization performance.
Limitations:
The experimental results presented in this paper may be limited to specific benchmarks and tasks, and performance verification in more diverse environments is required.
LessIsMore's performance improvements may be more effective for certain types of inference tasks, and do not guarantee the same level of performance improvement for all inference tasks.
Further research may be needed to explore the potential for performance degradation during long-term inference processes.
👍