This paper highlights that existing event-based Vision-Language Models (VLMs) struggle to explicitly understand sufficient meaning and context from event streams by focusing solely on traditional perceptual tasks utilizing CLIP. To overcome these limitations, we propose EventVL, the first generative event-based multimodal large-scale language model (MLLM) framework for explicit semantic understanding. We annotate a large dataset of approximately 1.4 million high-quality event-image/video-text data pairs to bridge the data gap for cross-modal semantic connections. We then design an event spatiotemporal representation to dynamically aggregate and segment information from the event stream, fully leveraging the comprehensive information. Furthermore, we introduce dynamic semantic alignment to enhance and complement the sparse semantic space of events. Experimental results demonstrate that EventVL significantly outperforms existing MLLM baseline models in both event caption generation and scene description generation tasks.