Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

Created by
  • Haebom

Author

Wenfeng Feng, Hongxiang Wang, Jianlong Wang, Xin Zhang, Jingjing Zhao, Yueyue Liang, Xiang Chen, Duokui Han

Outline

This paper proposes EDIT (Encoder-Decoder Image Transformer), a novel architecture designed to mitigate the attention sync phenomenon observed in the Vision Transformer model. EDIT leverages an encoder-decoder architecture, where the encoder uses self-attention to process image patches, and the decoder uses cross-attention to focus on [CLS] tokens. Unlike traditional encoder-decoder architectures, EDIT allows the decoder to gradually improve the representation layer by layer, starting from low-level features. EDIT provides interpretability through sequential attention maps and consistently outperforms the DeiT3 model on ImageNet-1k, ImageNet-21k, and transfer learning tasks.

Takeaways, Limitations

A novel architecture is presented to solve the attention sync problem.
Progressive feature extraction via layer-aligned encoder-decoder architecture.
Improving model interpretability through sequential attention maps.
Demonstrated performance improvement over the DeiT3 model on ImageNet and transfer learning tasks.
The specific Limitations is not specified in the paper.
👍