Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Trainable Dynamic Mask Sparse Attention

Created by
  • Haebom

Author

Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, Yuyu Luo

Outline

This paper proposes Dynamic Mask Attention (DMA), a learnable dynamic mask sparse attention mechanism, to address the quadratic complexity problem of standard self-attention mechanisms, which poses a limitation due to the increasing demand for long-text modeling. DMA leverages content-aware and position-aware sparsity to reduce computational complexity while minimizing information loss. Content-aware sparse masks are dynamically generated from value representations to focus on important information, while position-aware sparse attention skips unnecessary computational regions. Experimental results demonstrate that DMA outperforms various attention mechanisms (multi-head attention, sliding window attention, multi-head latent attention, and conventional sparse attention) in terms of perplexity under the Chinchilla Scaling Law setting, and demonstrates superior performance and efficiency in multi-query associative recall tasks. Notably, in a 1.7 billion-parameter model evaluation, DMA outperforms multi-head attention on both standard benchmarks and the needle-in-a-haystack task.

Takeaways, Limitations

Takeaways:
A novel attention mechanism, DMA, is presented that dynamically exploits content- and location-aware sparsity.
Solving the problems of static patterns and information loss, which are limitations of existing sparse attention mechanisms.
Effectively achieve a balance between computational efficiency and information accuracy.
Demonstrated superior performance and efficiency compared to existing attention mechanisms in various benchmark tasks.
Significantly contributes to improving efficiency in long-term context modeling
Limitations:
DMA performance improvements may be limited to specific datasets or tasks.
Further analysis of the complexity of DMA's learning and inference processes is needed.
Generalizability needs to be verified across a variety of model sizes and architectures.
Additional performance evaluation for extremely long contexts is needed.
👍