Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

COMMA: A Communicative Multimodal Multi-Agent Benchmark

Created by
  • Haebom

Author

Timothy Ossowski, Jixuan Chen, Danyal Maqbool, Zefan Cai, Tyler Bradshaw, Junjie Hu

Outline

This paper highlights that despite the rapid advancement of multimodal agents based on large-scale foundational models, the potential of language-based communication between agents in collaborative tasks has been largely overlooked. This highlights a critical gap in understanding its effectiveness in real-world deployments, particularly in human-to-human communication. Existing agent benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to accomplish tasks beyond their individual capabilities. To bridge this gap, this paper presents COMMA, a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. The COMMA benchmark provides a comprehensive assessment of four key categories of agent capabilities in communicative collaboration environments by providing a variety of multimodal puzzles. The results reveal surprising weaknesses in state-of-the-art models, including powerful proprietary models and inference models such as GPT-4o and o4-mini. Many thought process inference models, such as R1-Onevision and LLaVA-CoT, underperform random baselines in inter-agent collaboration, indicating potential growth areas for improving communication capabilities.

Takeaways, Limitations

Takeaways: We present a new benchmark (COMMA) for evaluating the collaborative language communication capabilities of multimodal, multi-agent systems. It exposes vulnerabilities in the inter-agent collaboration capabilities of state-of-the-art models and suggests future research directions. In particular, we emphasize the need to improve the communication capabilities of thought process inference models.
Limitations: Further research is needed to determine the generalizability of the COMMA benchmark itself and its applicability to various collaboration scenarios. Additional experiments are needed on models other than those currently evaluated. They may not fully reflect complex real-world collaboration tasks.
👍