Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Created by
  • Haebom

Author

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

Outline

This paper points out the problems of agent benchmarks for evaluating the performance of AI agents and presents the Agentic Benchmark Checklist (ABC), a guideline to solve them. It shows that many existing agent benchmarks can under- or overestimate agent performance by up to 100% due to problems in task setting or reward design. For example, SWE-bench Verified uses insufficient test cases, and TAU-bench considers empty answers as successful. ABC is created by synthesizing benchmark building experiences, best practice investigations, and previously reported problems, and is applied to CVE-Bench, which has a complex evaluation design, and shows the effect of reducing performance overestimation by 33%.

Takeaways, Limitations

Takeaways:
We provide a checklist (ABC) that can contribute to improving the reliability of AI agent benchmarks.
It clearly presents the problems with the design and evaluation methods of existing benchmarks.
We present methods to minimize errors that may occur during the benchmark design and evaluation process.
ABC can be used to improve the accuracy of AI agent performance evaluation.
Limitations:
Further validation is needed to determine whether ABC is applicable to all types of agent benchmarks.
Consideration should be given to any additional costs or time required to implement ABC.
Further research is needed to determine the comprehensiveness of the ABC checklist and its effectiveness in practical application.
👍