Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

Created by
  • Haebom

Author

Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais

Outline

This paper addresses the problem of semi-supervised semantic segmentation (SSS) in domain-varying environments by leveraging domain-invariant semantic knowledge derived from text embeddings of vision-language models (VLMs). We propose a unified hierarchical vision-language framework (HVL) that integrates domain-invariant text embeddings into object queries in a transformer-based segmentation network. This improves generalization performance and reduces misclassification in limited supervised learning environments. The proposed text queries are used to group pixels with shared meaning under SSS. HVL is designed to (1) generate text queries that capture within-class variation while maximizing domain-invariant semantics from VLMs, and (2) align these queries with spatial visual features to enhance segmentation performance and improve semantic clarity of visual features. Furthermore, we introduce a targeted regularization loss that maintains vision-language alignment throughout training to enhance semantic understanding. HVL establishes a new state-of-the-art by demonstrating superior performance with less than 1% supervised learning on four benchmark datasets: COCO (+9.3% improvement in mIoU with 232 labeled images), Pascal VOC (+3.1% improvement with 92 labels), ADE20 (+4.8% improvement with 316 labels), and Cityscapes (+3.4% improvement with 100 labels). The results demonstrate that language-induced segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.

Takeaways, Limitations

Takeaways:
We achieve significant performance gains in semi-supervised semantic segmentation using limited supervised training data.
We present a novel framework that effectively leverages domain-invariant semantic knowledge of vision-language models.
We bridge the label efficiency gap and enable fine-grained generalization through language-guided segmentation.
Achieved state-of-the-art performance on various benchmark datasets.
Limitations:
It is highly dependent on VLMs and may be affected by the performance of VLMs.
The proposed method may be computationally expensive.
Additional evaluation of generalization performance across different domain variations is needed.
There is a possibility of overfitting to certain domains.
👍