Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

Created by
  • Haebom

Author

Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Dimitris N. Metaxas, Tong Che

Outline

This paper presents a method for solving complex real-world problems that are difficult to solve with a single-agent system using a large-scale language model (LLM)-based multi-agent system (MAS). Recent advances in test-time scaling (TTS) have significantly improved single-agent performance on challenging inference tasks, but effectively scaling collaboration and inference in MAS remains an open challenge. This study presents an adaptive multi-agent framework designed to enhance collaborative inference through model-level training and system-level tuning. We construct a high-quality dataset, M500, containing 500 multi-agent collaborative inference traces. We fine-tune Qwen2.5-32B-Instruct on this dataset to generate a model, M1-32B, optimized for multi-agent collaboration. To further enable adaptive inference, we propose a novel CEO agent that guides agent collaboration and adjusts inference depth for more effective problem solving. Evaluations on an open-source MAS across a variety of tasks, including general understanding, mathematical reasoning, and coding, demonstrate that the proposed system significantly outperforms robust baseline models. For example, M1-32B achieves 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching the performance of state-of-the-art models such as DeepSeek-R1 on some tasks. These results highlight the importance of learned collaboration and adaptive coordination in scaling multi-agent inference. The code is available at https://github.com/jincan333/MAS-TTS .

Takeaways, Limitations

Takeaways:
We demonstrate that an adaptive multi-agent framework that integrates model-level training and system-level tuning is effective for solving complex inference problems.
Dynamic collaboration management and inference depth adjustment through the CEO agent contribute to improved performance.
Achieve competitive performance compared to state-of-the-art models across a wide range of tasks.
We are releasing the high-quality multi-agent collaborative inference dataset M500.
Limitations:
Further research is needed on the generalization performance of the proposed framework.
Evaluation of more complex and diverse tasks is needed.
There is a need for greater transparency and explainability in the decision-making process of CEO agents.
The size of the M500 dataset may need to be further expanded.
👍