This paper highlights that existing deep learning vision models are tailored to specific modalities and rely on domain-specific assumptions, such as the lattice structure used by most existing vision models. To overcome this, we propose Adaptive Superpixel Coding (ASC), a Transformer-based self-supervised learning model. The core of ASC lies in overcoming the limitations of existing Vision Transformers, which rely on non-adaptive, fixed-size patch segmentation. Instead, ASC utilizes an adaptive superpixel layer that dynamically adapts to the image content. In this paper, we analyze the key properties that make this method effective and demonstrate that it outperforms widely used methods on standard image downstream task benchmarks.