Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Created by
  • Haebom

Author

Christoph Timmermann, Hyunse Lee, Woojin Lee

Outline

While CLIP, which aligns image and text embeddings via contrastive learning, offers excellent zero-shot performance, its performance deteriorates in few-shot classification due to inconsistencies within homogeneous modalities. This issue arises from the disparity between modalities and CLIP's cross-modality training objective, and its inability to calibrate the embedding space makes direct comparisons between images difficult. SeMoBridge addresses this issue by mapping images to text modalities, a lightweight and robust approach that preserves semantic content. SeMoBridge is a closed-form, combined image and text alignment loss that can be optionally trained with multimodality supervision. SeMoBridge-T outperforms other methods with a shorter training time, particularly in low-data scenarios.

Takeaways, Limitations

Takeaways:
A novel approach to improve the few-shot classification performance of CLIP (SeMoBridge).
An efficient methodology to address the problem of inter-modality mismatch (mapping images to text modalities).
Outperforms other methods in low-data environments.
It is practical because it is simple and takes little training time.
Limitations:
There is no Limitations specified in the paper itself.
👍