This paper highlights that accurate and generalizable object segmentation in ultrasound images is challenging due to anatomical variations, diverse imaging protocols, and limited annotation data. To address this, we propose a prompt-based Vision-Language Model (VLM) that integrates Grounding DINO and SAM2. We use 18 publicly available ultrasound datasets (breast, thyroid, liver, prostate, kidney, and paraspinal muscles). Fifteen datasets are used for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA), while the remaining three are used for testing to evaluate performance on unknown distributions. Experimental results demonstrate that the proposed method outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS, on most existing datasets, maintaining robust performance even on unknown datasets without additional fine-tuning. This demonstrates that VLM reduces the reliance on large-scale organ-specific annotation data and holds promise for scalable and robust ultrasound image analysis.