Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

Created by
  • Haebom

Author

Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

Outline

This paper investigates how accurately multimodal large-scale language models (MLLMs) identify the orientation of images rotated at various angles (0°, 90°, 180°, and 270°). To achieve this, we present RotBench, a manually filtered benchmark of 350 images comprising lifestyle, portrait, and landscape images. We evaluate state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, and show that they fail to reliably identify image rotation. Providing additional information, such as captions or depth maps, or thought-chain prompting only marginally improves performance. Most models can identify 0° images, and some can identify 180° images, but they cannot distinguish between 90° and 270°. Simultaneous presentation of images in various orientations and the use of voting methods have improved performance. However, fine-tuning improves 180° image identification but not 90° and 270° discrimination. In conclusion, we show that there is a significant gap between the spatial reasoning ability of MLLM and human perceptual ability.

Takeaways, Limitations

Takeaways: Clearly demonstrates the limitations of MLLM's spatial reasoning capabilities, particularly in identifying image rotations. Provides insight into the effectiveness of providing additional information or prompt engineering. Suggests performance-enhancing strategies, such as presenting images rotated in different directions simultaneously or utilizing voting.
Limitations: RotBench's scale is relatively small. The type of MLLM used for evaluation may be limited. There is a possibility of bias toward certain types of images. Further research is needed to distinguish between 90° and 270° rotated images.
👍