This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Created by
Haebom
Author
Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song
Outline
The LiveMCP-101 benchmark is designed to evaluate the ability of AI agents to solve complex, multi-step tasks using various Model Context Protocol (MCP) tools. It consists of 101 real-world queries and requires coordinated use of multiple MCP tools, including web search, file operations, mathematical reasoning, and data analysis. Unlike traditional API output-based evaluation methods, it utilizes correct execution plans to better reflect the dynamic nature of real-world environments. Experimental results show that even state-of-the-art LLMs have success rates below 60% and exhibit various failure modes, including inefficiencies in token usage. This highlights the difficulty of tool tuning and suggests future directions for model improvement.
Takeaways, Limitations
•
Takeaways:
◦
It provides rigorous criteria for evaluating the ability to perform complex tasks using multiple tools in real-world environments.
◦
It clearly shows the limitations of the tool tuning capabilities of cutting-edge LLMs.
◦
We analyze various failure modes and inefficiencies that occur during tool use and suggest directions for model improvement.
◦
It presents important development directions for the development of autonomous AI systems.
•
Limitations:
◦
The size of the benchmark (101 queries) may be relatively limited.
◦
It may not perfectly reflect various situations in the real world.
◦
Improvements to assessment methods and integration of more diverse tools may be necessary.