Defending against backdoor attacks in a federated learning (FL) environment with heterogeneous client data distributions presents challenges in balancing effectiveness and privacy. Existing methodologies rely heavily on uniform client data distributions or the availability of clean server datasets. In this paper, we propose CLIP-Fed, a FL backdoor defense framework that leverages the zero-shot learning capabilities of vision-language pre-trained models. CLIP-Fed overcomes the Non-IID limitation on defense effectiveness by integrating pre-aggregation and post-aggregation defense strategies. Using prototype contrastive loss and Kullback-Leibler divergence, CLIP-Fed aligns the global model with CLIP knowledge on an augmented dataset, ensuring class prototype bias due to backdoor samples and removing correlations between trigger patterns and target labels. Furthermore, to balance privacy and coverage expansion across various triggers, we build and augment a server dataset without client samples using a multimodal large-scale language model and frequency analysis. It reduces the average attack success rate (ASR) by 2.03% on CIFAR-10 and 1.35% on CIFAR-10-LT, while improving the average main task accuracy (MTA) by 7.92% and 0.48%, respectively.