This paper surveys recent research trends in video-based collision detection in intelligent transportation systems. With the advancement of large-scale language models (LLMs) and vision-language models (VLMs), multimodal information processing, inference, and summarization are changing. This paper examines cutting-edge approaches that leverage LLMs for collision detection using video data. Specifically, we present a systematic classification of various fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss current challenges and opportunities, providing a foundation for future research in the rapidly growing interdisciplinary field of video understanding and foundational models.