This paper addresses the limitations of vision language models (VLMs) in understanding spatiotemporal interactions. Existing VLMs struggle to understand object motion, rotation, and viewpoint changes, which are essential capabilities for understanding dynamic real-world situations. Therefore, we present VLM4D, a novel benchmark for evaluating the spatiotemporal reasoning capabilities of VLMs. VLM4D consists of a variety of real and synthetic videos and carefully constructed question-answer pairs, emphasizing translational and rotational motion, viewpoint awareness, and motion continuity. A comprehensive evaluation of state-of-the-art VLMs reveals significant performance gaps compared to human benchmarks, highlighting fundamental deficiencies in existing models. Our analysis reveals that VLMs struggle to integrate multiple visual cues and maintain temporal coherence. We also explore promising directions, such as 4D feature field reconstruction and fine-tuning goal-directed spatiotemporal supervised learning, demonstrating their effectiveness in enhancing spatiotemporal understanding. This study aims to encourage further exploration of spatial and temporal enhancements to VLMs, towards more capable and reliable visual intelligence for dynamic environments.