This paper points out the limitations of the attention mechanism using softmax in Transformer-based architectures and proposes a novel method to improve them. Softmax generates a probability distribution for all tokens, which is inefficient for tasks that require precise attention to fixed-size patterns. In long sequences, uninformative tokens accumulate attention probability mass, causing variance and representation collapse. In this paper, we show that a sparse attention mechanism using α-entmax can solve these problems because α-entmax can assign 0 to irrelevant tokens. In addition, we propose Adaptive-Scalable Entmax (ASEntmax) with learnable temperature parameters, which allows interpolation between sparse (pattern-centric) and dense (softmax-like) regions. Finally, we show that the ability to find and generalize fixed-size patterns can be improved by designing appropriate positional encoding. By incorporating ASEntmax and appropriate positional encoding into the standard Transformer layer, we achieve better performance than softmax, scalable softmax, and fixed-temperature α-entmax baseline models in long-context generalization.