This paper presents a method for investigating cellular morphological responses at an unprecedented scale, leveraging high-content screening (HCS) assays and high-throughput microscopy techniques such as cell painting. This data collection is expected to help us better understand the relationships between various perturbations and their impact on cellular states. To this end, recent advances in cross-modal contrastive learning can be leveraged to learn a unified latent space that aligns perturbations with their corresponding morphological effects. However, applying this approach to HCS data is challenging due to the semantic differences in cell painting images compared to natural images and the difficulty of representing different types of perturbations, such as small molecules versus CRISPR gene knockouts, in a single latent space. To address these challenges, this paper introduces CellCLIP, a cross-modal contrastive learning framework for HCS data. CellCLIP utilizes a pre-trained image encoder and a novel channel encoding scheme to better capture relationships between various microscopy channels in image embeddings, and utilizes a natural language encoder to represent the perturbations. The proposed framework outperforms existing open-source models, achieving state-of-the-art performance in both cross-modal search and biologically meaningful downstream tasks while significantly reducing computation time.