Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Created by
  • Haebom

Author

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

Outline

This paper identifies the problems of agent benchmarks for evaluating the performance of AI agents and proposes the Agentic Benchmark Checklist (ABC), a guideline to address these issues. Many existing agent benchmarks demonstrate that they can underestimate or overestimate agent performance by up to 100% due to issues with task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, and TAU-bench considers empty responses as successful. ABC was developed by synthesizing benchmark building experience, best practice research, and previously reported issues. When applied to CVE-Bench, which has a complex evaluation design, ABC demonstrated a 33% reduction in performance overestimation.

Takeaways, Limitations

Takeaways:
We present systematic guidelines (ABC) for ensuring the reliability of AI agent benchmarks.
We reveal problems with the design and evaluation methods of existing benchmarks and demonstrate the severity of performance evaluation errors resulting from them.
Applying ABC can improve the reliability of benchmarks and increase the accuracy of performance evaluation of AI agents.
Limitations:
Further validation is needed to determine whether ABC is applicable to all types of agent benchmarks.
The process of applying ABC can be complex and time-consuming.
Further research may be needed to determine the completeness and objectivity of the ABC itself.
👍