This paper studies the "outlier dimension" in the final layer, which exhibits extreme activations for most inputs. We demonstrate that this outlier dimension occurs in various state-of-the-art language models and that its function is related to a heuristic that consistently predicts frequent words. Furthermore, we demonstrate that this heuristic can be counteracted by assigning balanced weights to the remaining dimensions when the model is inappropriate for the context. We investigate when model parameters increase the outlier dimension and when it occurs during training. In conclusion, we demonstrate that the outlier dimension is a specialized mechanism discovered by many models to implement useful token prediction heuristics.