This paper argues that while existing vision models rely primarily on local texture information to generate weak and non-compositional features, humans recognize objects based on both local texture information and the composition of object parts. Existing studies of shape vs. texture bias have set up shape and texture representations as adversarial, measuring shape relative to texture, and overlooking the possibility that models (and humans) can utilize both types of cues simultaneously, as well as the absolute quality of both types of representations. In this paper, we reframe shape assessment as an absolute problem of compositional ability, and implement it through the Compositional Shape Score (CSS). CSS measures the ability to recognize images of object-anagram pairs that depict different object categories while maintaining local texture, but changing the global arrangement of parts. Through analysis of 86 convolutional, transformer, and hybrid models, CSS reveals a wide range of compositional sensitivity, with fully self-supervised learning and language alignment transformers such as DINOv2, SigLIP2, and EVA-CLIP occupying the top end of the CSS spectrum. Mechanistic investigations reveal that high-CSS networks rely on long-range interactions, that radial-controlled attention masks destroy performance and exhibit a unique U-shaped integration profile, and that representational similarity analysis reveals an intermediate-depth transition from local to global coding. BagNet control remains at a chance level, ruling out “edge-hacking” strategies. Finally, compositional shape scores also predict other shape-dependent assessments. In conclusion, we suggest that the path toward truly robust, generalizable, and human-like vision systems may lie in architectures and learning frameworks that seamlessly integrate local texture and global compositional shape, rather than forcing an artificial choice between shape and texture.