This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts
Created by
Haebom
Author
Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, Juhwan Cho, Seungmin Baek, Jung Ho Ahn
Outline
This paper points out that the workload of existing Transformer models is bifurcated into the memory constraints of Multi-Head Attention (MHA) and the computational constraints of the feedforward layer. This bifurcation has prompted specialized hardware research to alleviate the MHA bottleneck. However, the paper argues that recent architectural changes such as Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE) reduce the need for specialized attention hardware. It shows that the computational intensity of MLA is two orders of magnitude higher than that of MHA, which enables efficient computation on modern accelerators such as GPUs, and that MoE can match the dense layers by distributing experts across a pool of accelerators and adjusting the computational intensity through batches. Therefore, the central challenge of the next-generation Transformer should no longer be the acceleration of a single memory-constrained layer, but should shift to designing a balanced system with sufficient computational power, memory capacity, memory bandwidth, and high-bandwidth interconnect to manage the diverse requirements of large models.
Takeaways, Limitations
•
Takeaways:
◦
The introduction of MLA and MoE presents a novel approach to alleviate the memory bottleneck of existing Transformers.
◦
This suggests a reduced need for specialized attention hardware.
◦
It suggests that the next-generation Transformer design direction should shift to building balanced systems, emphasizing consideration of computational power, memory capacity, memory bandwidth, and interconnection.
•
Limitations:
◦
The effectiveness of MLA and MoE may depend on specific models and datasets and may not be a general solution applicable in all cases.
◦
The analysis presented in the paper is based on theoretical analysis and requires verification through actual implementation and performance evaluation.
◦
There is a lack of specific guidelines or architectural suggestions for balanced system design.