Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Created by
  • Haebom

Author

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

Outline

This paper points out the problems of agent benchmarks for evaluating the performance of AI agents and presents the Agentic Benchmark Checklist (ABC), a guideline to solve them. It shows that existing agent benchmarks can under- or overestimate agent performance by up to 100% due to problems in task setting or reward design. For example, SWE-bench Verified uses insufficient test cases, and TAU-bench considers empty responses as successful. ABC is created by synthesizing benchmark building experiences, best practice investigations, and previously reported problems, and is applied to CVE-Bench, which has a complex evaluation design, and shows an effect of reducing performance overestimation by 33%.

Takeaways, Limitations

Takeaways:
We provide systematic guidelines (ABC) to ensure the reliability of AI agent benchmarks.
We identify problems with the design and evaluation methods of existing benchmarks and suggest directions for improvement.
ABC can be used to improve the accuracy of performance evaluation of AI agents.
Limitations:
Further validation is needed to determine whether ABC is applicable to all types of agent benchmarks.
The process of applying ABC can be complex and time-consuming.
Further research is needed into the completeness and objectivity of the ABC itself.
👍