Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment

Created by
  • Haebom

Author

Dahun Kim, Anelia Angelova

Outline

This paper proposes Context-Adaptive Multi-Prompt Embedding, a novel method for enriching semantic representations in visual-language contrastive learning. Unlike existing CLIP-style models that rely on a single text embedding, this study introduces multiple structured prompts, each containing unique adaptive tokens that capture different semantic aspects of the input text. Within the CLIP framework, we leverage a pre-trained LLM as a text encoder to jointly process all prompts in a single pass. The resulting prompt embeddings are combined into a unified text representation, enabling richer semantic alignment with visual features. To further enhance semantic diversity and representational quality, we incorporate diversity regularization losses and negation recognition losses to encourage specialization among prompts and improve contrastive discrimination. Our method achieves consistent performance gains on image-to-text and video-to-text retrieval benchmarks.

Takeaways, Limitations

Takeaways:
We demonstrate that leveraging multiple prompts can enhance the richness of semantic representations in visual-verbal contrastive learning.
We present a method to effectively utilize pre-trained LLMs to capture various semantic aspects.
It achieves performance improvements through diversity regulation loss and negative recognition loss.
We experimentally demonstrate performance improvements in image-to-text and video-to-text retrieval tasks.
Limitations:
The proposed method may be more computationally expensive than existing methods (multiple prompt processing).
There may be some dependencies on specific LLMs.
Further research may be needed to determine the optimal hyperparameter settings for diversity regularization loss and negation recognition loss.
Due to limitations of the benchmark used, further validation of generalization performance may be required.
👍