Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Uncovering Grounding IDs: How External Cues Shape Multi-Modal Binding

Created by
  • Haebom

Author

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

Outline

Large-scale visual-language models (LVLMs) demonstrate powerful performance on multimodal benchmarks, but suffer from limitations in structural inference and accurate grounding. This study investigates the phenomenon that adding simple visual structures (e.g., segmentation, annotation, etc.) improves accuracy and proposes the concept of "grounding IDs," latent identifiers induced by external cues. Grounding IDs link entities to specified segments across modalities. Representation analysis reveals that these identifiers exhibit robust intra-segment alignment in embedding space, bridging the modality gap between images and text. Causal intervention confirms that these identifiers mediate the association between objects and symbolic cues. Grounding IDs enhance cross-modal grounding and reduce hallucinations by enhancing attention between relevant components. Our findings reveal that grounding IDs are a key symbolic mechanism by which external cues enhance multimodal binding, offering both interpretability and substantial robustness improvements.

Takeaways, Limitations

Takeaways:
Grounding IDs provide a core mechanism to explain enhanced multimodal coupling through external cues.
Grounding IDs reduce the modality gap by inducing intra-partition alignment in the embedding space.
Grounding IDs enhance the attention mechanism to improve cross-modal grounding and reduce hallucinations.
Provides improved interpretability and practical robustness.
Limitations:
The specific Limitations is not explicitly mentioned in the paper.
👍