Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Created by
  • Haebom

Author

Jiawei Lin, Guokai Chen, Yuanlong Li, Thomas Bourgeat

Outline

This paper proposes Flash Systolic Array (FSA), a novel systolic array-based architecture for efficient acceleration of Transformer models based on the FlashAttention algorithm. Existing systolic array-based accelerators suffer from low utilization and performance degradation due to the frequent interleaved execution of FlashAttention's matrix multiplication and softmax operations. FSA implements a novel scheduling algorithm called SystolicAttention to fully execute FlashAttention operations within a single systolic array. This allows for fine-grained overlap of matrix multiplication and softmax operations without the need for external vector units, significantly improving array utilization. Implemented as synthesizable RTL, FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization than AWS Neuron v2 and Google TPUv5e, respectively, with only a 12% area overhead.

Takeaways, Limitations

Takeaways:
By enabling full execution of the FlashAttention algorithm within a single systolic array, we address the performance degradation issues of existing architectures.
Efficient parallel processing of matrix multiplication and softmax operations and high array utilization were achieved through the SystolicAttention algorithm.
It shows significantly higher performance than AWS Neuron v2 and Google TPUv5e, suggesting the possibility of designing a competitive hardware accelerator.
It demonstrates an economical design with high performance improvement and low area overhead.
Limitations:
The performance improvements of the FSA architecture presented in this paper are presented through comparison with specific hardware platforms (AWS Neuron v2, Google TPUv5e), so performance on other platforms requires additional verification.
The effectiveness of FSA is highly dependent on the performance of the SystolicAttention algorithm, and its generalization performance on input data of various sizes and shapes requires further research.
Analysis of energy efficiency is lacking. While high performance has been achieved, power consumption may have increased.
👍