Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Created by
  • Haebom

Author

Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang

Outline

SeerAttention-R is a sparse attention framework specifically designed for long-form decoding of long-form inference models. We extend SeerAttention by removing query pooling to accommodate autoregressive decoding while maintaining the design of learning attention sparsity via a self-distillation gating mechanism. With lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pre-trained models without modifying the original parameters. Trained with only 0.4B tokens, SeerAttention-R demonstrates nearly lossless inference accuracy with a 4K token budget on the AIME benchmark at large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that approaches up to 9x theoretical speedup over FlashAttention-3 at 90% sparsity on an H100 GPU. The code is available at https://github.com/microsoft/SeerAttention .

Takeaways, Limitations

Takeaways:
We present an efficient sparse attention mechanism for long decoding of long-form inference models.
It is designed as a plug-in that can be easily integrated into existing models, providing high usability.
Achieve high inference accuracy even with limited data (0.4B tokens).
Speedup is achieved through optimized decoding kernel based on TileLang.
Limitations:
Only performance for the AIME benchmark is presented, so generalization to other benchmarks is uncertain.
Since it was trained on a limited dataset of 0.4B tokens, there is room for performance improvement when using a larger dataset.
There is a dependency on TileLang. Performance may be poor in other environments.
👍