Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Created by
  • Haebom

Author

Bo Gao, Michael W. Spratling

Outline

This paper proposes a novel attention mechanism to address the numerical instability and performance degradation of conventional softmax attention at long inference token lengths. We decompose the softmax operation into a nonlinear positive transformation and $l_1$-regularization steps, demonstrating that $l_1$-regularization is essential for maintaining model performance. In the first step, we introduce a numerically stable softplus activation function instead of an exponential function and a dynamic scaling factor based on invariant entropy, thereby outperforming conventional softmax attention. In the second step, we introduce a reweighting mechanism that sharpens the attention distribution, amplifying important weights and diminishing weak weights to more effectively focus attention on relevant tokens. Combining these two approaches ensures numerical stability and achieves excellent results on long context extraction tasks and standard downstream benchmarks, while maintaining a nearly constant validation loss even at 16x the training length and dramatically improving length extrapolation performance.

Takeaways, Limitations

Takeaways:
An effective solution to the numerical instability and poor performance in long-context processing of softmax attention is presented.
Improving attention mechanism performance through soft-plus activation function, dynamic scaling factor, and reweighting mechanism.
Achieve excellent performance on long context extraction tasks and downstream benchmarks.
Maintains stable performance even in contexts 16 times longer than the learning length.
Limitations:
Analysis of the computational complexity of the proposed method may be lacking.
Further experimental results on various types of long context datasets may be needed.
Further research may be needed to determine the generalization performance of the proposed method.
👍