Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HueManity: Probing Fine-Grained Visual Perception in MLLMs

Created by
  • Haebom

Author

Ryana Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

Outline

This paper addresses the limitations of multimodal large-scale language models (MLLMs) in performing subtle perceptual tasks. We present a new benchmark, HueManity, consisting of 83,850 images containing two-character alphanumeric strings in Ishihara-style dot patterns. Nine state-of-the-art MLLMs were evaluated on HueManity and showed significant performance degradation compared to humans and existing computer vision baseline models. The best-performing MLLM achieved 33.6% accuracy on the "easy" digit-based task and 3% accuracy on the "hard" alphanumeric task, while human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model achieved 96.5% and 94.5% accuracy, respectively. This highlights a significant gap in the visual capabilities of current MLLMs. We also analyze potential architectural and training paradigm factors contributing to the perceptual gap in MLLM, and release the HueManity dataset and code to stimulate further research on improving the perceptual robustness of MLLM.

Takeaways, Limitations

Takeaways:
In contrast to MLLM's high-order visual reasoning ability, it clearly shows limitations in its ability to perform subtle perceptual tasks.
Suggesting research directions for improving the visual perception ability of MLLM.
Enabling MLLM research by releasing the HueManity dataset.
Limitations:
The HueManity benchmark focuses on a specific type of visual task and may not fully assess the overall visual capabilities of MLLM.
The diversity of architectures and training paradigms of the analyzed MLLMs may be limited.
👍