Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Created by
  • Haebom

Author

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

Outline

VisualOverload is a novel Visual Question Answering (VQA) benchmark designed to verify whether state-of-the-art Visual Language Models (VLMs) truly address basic visual understanding. It consists of 2,720 question-answer pairs, using high-resolution public-domain images with complexly detailed backgrounds and multiple characters, actions, and subplots. This benchmark challenges VLMs to perform simple, knowledge-free visual tasks in dense scenes, and it hypothesizes that current benchmarks may overestimate the performance of VLMs. Our tests show that even the best models exhibit low accuracy on VisualOverload, demonstrating that detail encoding and inference remain challenging. Error analysis uncovers several failure modes in VLMs.

Takeaways, Limitations

Takeaways:
Currently, VLMs struggle to understand details in complex and dense scenes.
VisualOverload exposes the weaknesses of VLM and provides valuable resources for developing better models.
This suggests that existing VQA benchmarks may overestimate the actual performance of VLM.
Limitations:
The model fails to maintain logical consistency in counting, OCR, and complex tasks.
Even the best performing model among the 37 models tested had low accuracy.
👍