This paper addresses the limitations of multimodal large-scale language models (MLLMs) in performing subtle perceptual tasks. We present a new benchmark, HueManity, consisting of 83,850 images containing two-character alphanumeric strings in Ishihara-style dot patterns. Nine state-of-the-art MLLMs were evaluated on HueManity and showed significant performance degradation compared to humans and existing computer vision baseline models. The best-performing MLLM achieved 33.6% accuracy on the "easy" digit-based task and 3% accuracy on the "hard" alphanumeric task, while human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model achieved 96.5% and 94.5% accuracy, respectively. This highlights a significant gap in the visual capabilities of current MLLMs. We also analyze potential architectural and training paradigm factors contributing to the perceptual gap in MLLM, and release the HueManity dataset and code to stimulate further research on improving the perceptual robustness of MLLM.