This paper investigates how accurately multimodal large-scale language models (MLLMs) identify the orientation of images rotated at various angles (0°, 90°, 180°, and 270°). To achieve this, we present RotBench, a manually filtered benchmark of 350 images comprising lifestyle, portrait, and landscape images. We evaluate state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, and show that they fail to reliably identify image rotation. Providing additional information, such as captions or depth maps, or thought-chain prompting only marginally improves performance. Most models can identify 0° images, and some can identify 180° images, but they cannot distinguish between 90° and 270°. Simultaneous presentation of images in various orientations and the use of voting methods have improved performance. However, fine-tuning improves 180° image identification but not 90° and 270° discrimination. In conclusion, we show that there is a significant gap between the spatial reasoning ability of MLLM and human perceptual ability.