This paper presents the first electroencephalogram-to-language model (ELM) using clinical reports and 15,000 electroencephalogram (EEG) data sets. Given that previous multimodal language modeling research has not been applied to clinical phenotypic analysis of functional brain data, we combine multimodal alignment through time-series trimming and text segmentation, and propose multi-instance learning-based augmentation to mitigate inconsistencies between irrelevant EEG or text segments. Experimental results demonstrate that the proposed multimodal model significantly outperforms EEG-only models across four clinical trials, enabling zero-shot classification and retrieval of both neural signals and reports for the first time. This represents a significant advance demonstrating the clinical applicability of ELM.