Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Created by
  • Haebom

Author

Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

Outline

This paper analyzes the attention patterns of Vision-Language Models (VLMs) and proposes a novel method for improving them to address the performance degradation of VLMs in complex visual environments. Our research reveals a strong correlation between attention entropy and visual complexity, leading to degraded inference performance. Furthermore, we find that attention is gradually refined from global scanning in shallow layers to focused convergence in deep layers, with the degree of convergence determined by visual complexity. Based on this insight, we propose CARVE (Contrastive Attention Refinement for Visual Enhancement), a training-free method that extracts task-relevant visual cues through pixel-level attention contrast. Experimental results demonstrate that CARVE achieves up to 75% performance improvement on open-source models.

Takeaways, Limitations

Takeaways:
We investigated the relationship between visual complexity and inference performance by analyzing the attention mechanism of VLMs.
We present CARVE, an efficient method to improve the performance of VLMs without training.
We present a novel approach that decomposes visual signals into semantic signals and visual noise by leveraging attention contrast.
It showed significant performance improvements over the open source model.
Limitations:
Further research is needed to determine whether CARVE's performance improvements are consistent across all VLMs and all types of visual complexity.
The proposed method may be biased towards certain types of VLMs or certain tasks.
Pixel-wise attention contrast can be computationally expensive.
👍