As leveraging unlabeled data in semi-supervised learning (SSL) becomes increasingly important to address data scarcity, vision-language models (VLMs) pretrained on large-scale image-text pairs often demonstrate superior generalization performance and outperform SSL. This paper presents research on how to effectively leverage the powerful generalization capabilities of VLMs for specific task models. Knowledge distillation (KD) is a natural framework for transferring VLM capabilities, but suffers from gradient conflicts between the supervised learning loss and the distillation loss. To address this, we propose Dual-Head Optimization (DHO), which introduces dual prediction heads for different signals. DHO resolves the gradient conflict, enabling improved feature learning compared to single-head KD-based models. It also offers minimal computational overhead and allows for hyperparameter tuning at test time without retraining. Extensive experiments on 15 datasets show that DHO consistently outperforms KD-based models, often outperforming the teacher model with smaller student models. DHO also achieved new state-of-the-art performance in generalization on in-distribution ImageNet semi-supervised learning and out-of-distribution ImageNet variants.