This paper presents a novel approach to electroencephalography (EEG) signal analysis, which is difficult to analyze effectively due to limited data, high dimensionality, and the absence of models that cannot fully capture spatiotemporal dependencies. Unlike existing self-supervised learning (SSL) methods that focus on either spatial or temporal features, in this paper we propose an EEG-VJEPA model that treats EEG as a video-like sequence and learns spatiotemporal representations. EEG-VJEPA applies the Video Joint Embedding Predictive Architecture (V-JEPA) to EEG classification, and learns meaningful spatiotemporal representations using joint embedding and adaptive masking. Experimental results using the TUH Abnormal EEG dataset demonstrate that EEG-VJEPA outperforms existing state-of-the-art models in classification accuracy, demonstrating its potential to support human-AI collaboration in diagnostic workflows by capturing physiologically relevant spatiotemporal signal patterns and providing interpretable embeddings.