Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CountQA: How Well Do MLLMs Count in the Wild?

Created by
  • Haebom

Author

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, Sahiti Yerramilli

Outline

This paper addresses the problem of insufficient object counting capabilities in multimodal large-scale language models (MLLMs). We highlight the limitations of existing benchmarks (low object density and limited visual regions) and propose CountQA, a novel benchmark for evaluating the object counting performance of MLLMs under realistic conditions. CountQA consists of over 1,500 question-answer pairs containing real-world images with high object density, clutter, and occlusion. Evaluating 15 leading MLLMs with CountQA reveals that the best-performing model achieved only 42.9% accuracy, with performance degrading as the number of objects increases. CountQA provides a dedicated benchmark for diagnosing and improving the object counting capabilities of MLLMs, laying the foundation for the development of next-generation MLLMs that are not only technically fluent but also numerically accurate and spatially aware.

Takeaways, Limitations

Takeaways:
We present a new benchmark, CountQA, which clearly demonstrates the inadequacy of MLLM's object counting ability under realistic conditions.
Suggesting research directions for improving the object counting performance of MLLM.
Facilitating further research through the release of the CountQA dataset and code.
Limitations:
The CountQA benchmark is still in its early stages and needs to be expanded to include more visual situations and object types.
Since the performance of the currently evaluated MLLM is relatively low, we must continuously monitor the performance improvement of more advanced MLLM in the future.
👍