Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions

Created by
  • Haebom

Author

Lucas Moller, Pascal Tilli, Ngoc Thang Vu, Sebastian Pad o

Outline

This paper analyzes how a dual-encoder architecture like CLIP maps two types of inputs to a shared embedding space and predicts their similarity. To overcome the limitations of existing first-order feature attribution methods, we propose a second-order method that enables attribution of feature interactions to the dual-encoder's predictions. Applying this method to the CLIP model, we demonstrate that it learns fine-grained correspondences between caption segments and image regions, accounting for object matches as well as mismatches. However, we reveal that this visual-linguistic capability varies significantly across object classes, exhibits significant domain-external effects, and can identify both individual errors and systematic failure patterns. The code is publicly available.

Takeaways, Limitations

Takeaways:
We present a novel second-order method that enables attribution of feature interactions to the predictions of dual encoder models.
We demonstrate that the CLIP model learns fine-grained correspondences between captions and image regions, taking into account both object matching and mismatch.
Presents the strengths and limitations of the visual-linguistic-based capabilities of the CLIP model (object class differences, domain-external effects, individual errors, and systematic failure modes).
Reproducibility and further research are possible through open code.
Limitations:
The visual-linguistic capabilities of the CLIP model vary significantly across object classes and domains.
There are individual errors and systematic failure types.
👍