This paper presents a multimodal approach for detecting sexism in online video content, particularly on social media platforms like TikTok and Vitut. We introduce a novel Spanish-language multimodal sexism detection dataset, MuSeD (approximately 11 hours of video), and propose an innovative annotation framework that analyzes the contributions of text, speech, and visual modalities. We evaluate various large-scale language models (LLMs) and multimodal LLMs on sexism detection tasks, finding that visual information plays a crucial role in labeling sexist content. While the models effectively detect explicit sexism, they struggle with implicit forms of sexism, such as stereotypes, consistent with low inter-annotator agreement. This underscores the inherent difficulty of identifying implicit sexism, as it relies on social and cultural context.