In this paper, we propose AttentionSpan, a new benchmark for evaluating how reliably pretrained language models perform algorithmic tasks, especially when they maintain their performance on previously unseen input/output domains. AttentionSpan consists of five tasks with infinite input domains, designed to distinguish between algorithmic understanding and memorization. This allows us to evaluate the ability of the model to generalize to unseen input types, including novel lengths, value ranges, or input domains, and the robustness of the learned mechanism. Through attention map analysis and targeted interventions, we show that the attention mechanism is directly responsible for generalization failure. Implementations of all tasks and interpretability methods are publicly available.