Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

Created by
  • Haebom

Author

Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez

Outline

This paper argues that while existing vision models rely primarily on local texture information to generate weak and non-compositional features, humans recognize objects based on both local texture information and the composition of object parts. Existing studies of shape vs. texture bias have set up shape and texture representations as adversarial, measuring shape relative to texture, and overlooking the possibility that models (and humans) can utilize both types of cues simultaneously, as well as the absolute quality of both types of representations. In this paper, we reframe shape assessment as an absolute problem of compositional ability, and implement it through the Compositional Shape Score (CSS). CSS measures the ability to recognize images of object-anagram pairs that depict different object categories while maintaining local texture, but changing the global arrangement of parts. Through analysis of 86 convolutional, transformer, and hybrid models, CSS reveals a wide range of compositional sensitivity, with fully self-supervised learning and language alignment transformers such as DINOv2, SigLIP2, and EVA-CLIP occupying the top end of the CSS spectrum. Mechanistic investigations reveal that high-CSS networks rely on long-range interactions, that radial-controlled attention masks destroy performance and exhibit a unique U-shaped integration profile, and that representational similarity analysis reveals an intermediate-depth transition from local to global coding. BagNet control remains at a chance level, ruling out “edge-hacking” strategies. Finally, compositional shape scores also predict other shape-dependent assessments. In conclusion, we suggest that the path toward truly robust, generalizable, and human-like vision systems may lie in architectures and learning frameworks that seamlessly integrate local texture and global compositional shape, rather than forcing an artificial choice between shape and texture.

Takeaways, Limitations

Takeaways:
We present CSS, a new shape evaluation metric that simultaneously considers local texture and global shape information.
We demonstrate excellent compositional morphology recognition capabilities of self-supervised learning and language alignment transformer models.
Elucidating long-range interactions and local-to-global coding transitions in high-performance models.
A new direction for building robust and generalizable vision systems (integrating local texture and global shape information).
Verified CSS's ability to predict other shape-dependent evaluation metrics.
Limitations:
Further research is needed on the versatility of CSS and its generalizability to various object categories.
Potential bias towards specific model architectures and learning methods.
Lack of direct comparison with human form recognition mechanisms.
Excluding perimeter hacking strategies through BagNet control may be limited to specific models.
👍