This paper highlights that despite the rapid advancement of multimodal agents based on large-scale foundational models, the potential of language-based communication between agents in collaborative tasks has been largely overlooked. This highlights a critical gap in understanding its effectiveness in real-world deployments, particularly in human-to-human communication. Existing agent benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to accomplish tasks beyond their individual capabilities. To bridge this gap, this paper presents COMMA, a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. The COMMA benchmark provides a comprehensive assessment of four key categories of agent capabilities in communicative collaboration environments by providing a variety of multimodal puzzles. The results reveal surprising weaknesses in state-of-the-art models, including powerful proprietary models and inference models such as GPT-4o and o4-mini. Many thought process inference models, such as R1-Onevision and LLaVA-CoT, underperform random baselines in inter-agent collaboration, indicating potential growth areas for improving communication capabilities.