Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Long-Context Generalization with Sparse Attention

Created by
  • Haebom

Author

Pavlo Vasylenko, Marcos Treviso, Andre FT Martins

Outline

This paper points out the limitations of the attention mechanism using softmax in Transformer-based architectures and proposes a novel method to improve them. Softmax generates a probability distribution for all tokens, which is inefficient for tasks that require precise attention to fixed-size patterns. In long sequences, uninformative tokens accumulate attention probability mass, causing variance and representation collapse. In this paper, we show that a sparse attention mechanism using α-entmax can solve these problems because α-entmax can assign 0 to irrelevant tokens. In addition, we propose Adaptive-Scalable Entmax (ASEntmax) with learnable temperature parameters, which allows interpolation between sparse (pattern-centric) and dense (softmax-like) regions. Finally, we show that the ability to find and generalize fixed-size patterns can be improved by designing appropriate positional encoding. By incorporating ASEntmax and appropriate positional encoding into the standard Transformer layer, we achieve better performance than softmax, scalable softmax, and fixed-temperature α-entmax baseline models in long-context generalization.

Takeaways, Limitations

Takeaways:
We propose a feasibility study to address attention distribution and representation collapse problems in long sequences using an α-entmax-based sparse attention mechanism.
Flexible control between sparse and dense attention via ASEntmax.
Emphasize the importance of proper position encoding design and verify performance improvements.
Experimentally demonstrating improved performance over existing methods on long-context generalization tasks.
Limitations:
Further analysis is needed on learning the temperature parameters of ASEntmax.
Additional evaluation of the generalization performance of the proposed method on various tasks and datasets is needed.
A more general and systematic approach to positional encoding design is needed.
👍