Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Octic Vision Transformers: Quicker Visions Through Equivariance

Created by
  • Haebom

Author

David Nordstr om, Johan Edstedt, Fredrik Kahl, Georg B okman

Outline

This paper highlights why state-of-the-art Vision Transformers (ViTs) are not designed to exploit natural geometric symmetries such as 90-degree rotations and reflections, arguing that the lack of efficient implementations is the cause. To this end, we introduce Octic Vision Transformers (octic ViTs), which capture these symmetries by exploiting the octic group isomorphism. In contrast to the computational overhead of conventional isomorphism models, octic linear layers achieve a 5.33x reduction in FLOPs and up to an 8x reduction in memory compared to regular linear layers. We study two new ViT families built with octic blocks, and train the octic ViTs on ImageNet-1K using supervised (DeiT-III) and unsupervised (DINOv2) learning, achieving both baseline accuracy and significant efficiency improvements.

Takeaways, Limitations

Takeaways:
Octic ViTs significantly improve computational and memory efficiency while maintaining the same accuracy as conventional ViTs.
Effectively capture geometric symmetry by utilizing octahedral group isomorphism.
It provides flexibility by presenting two new ViT series (fully octic isomorphic and partially isomorphic).
It showed efficiency advantages in both supervised and unsupervised learning.
Limitations:
The specific structural and implementation details of the octic ViTs presented in the paper are not specified in the paper.
Performance validation on datasets other than ImageNet-1K was not presented.
There is a lack of detailed explanation of the network design in the final part that breaks the isovariance.
👍