This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
작성자
Haebom
Author
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara
Outline
Talk2DINO is an Open Vocabulary Segmentation (OVS) paper that presents a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding capabilities of CLIP. To address the challenges of spatial localization in existing vision-language models and the lack of language integration in self-supervised learning-based visual models, we align CLIP's text embeddings with DINOv2's patch-level features using a learned mapping function. We leverage DINOv2's attention maps to selectively align local visual patches with text embeddings, without fine-tuning the underlying backbone. We demonstrate that Talk2DINO produces natural, low-noise segmentations and effectively distinguishes foreground objects from background. It achieves state-of-the-art performance on several unsupervised OVS benchmarks. The source code and models are publicly available.
Takeaways, Limitations
•
Takeaways:
◦
Combining the advantages of DINOv2 and CLIP to overcome the limitations of existing OVS methods.
◦
Efficient learning and improved performance through selective sorting using attention maps.
◦
Achieve excellent performance without backbone fine-tuning.
◦
Generate natural, low-noise segmentation results.
◦
Effective distinction between foreground and background.
◦
Achieve cutting-edge performance and open source code and models.
•
Limitations:
◦
This paper does not explicitly address specific Limitations issues. These are areas that could be explored through further experimentation or analysis (e.g., vulnerability to specific types of images or text, computational cost, scalability, etc.).