This paper addresses the problem of semi-supervised semantic segmentation (SSS) in domain-varying environments by leveraging domain-invariant semantic knowledge derived from text embeddings of vision-language models (VLMs). We propose a unified hierarchical vision-language framework (HVL) that integrates domain-invariant text embeddings into object queries in a transformer-based segmentation network. This improves generalization performance and reduces misclassification in limited supervised learning environments. The proposed text queries are used to group pixels with shared meaning under SSS. HVL is designed to (1) generate text queries that capture within-class variation while maximizing domain-invariant semantics from VLMs, and (2) align these queries with spatial visual features to enhance segmentation performance and improve semantic clarity of visual features. Furthermore, we introduce a targeted regularization loss that maintains vision-language alignment throughout training to enhance semantic understanding. HVL establishes a new state-of-the-art by demonstrating superior performance with less than 1% supervised learning on four benchmark datasets: COCO (+9.3% improvement in mIoU with 232 labeled images), Pascal VOC (+3.1% improvement with 92 labels), ADE20 (+4.8% improvement with 316 labels), and Cityscapes (+3.4% improvement with 100 labels). The results demonstrate that language-induced segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.