[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Created by
  • Haebom

Author

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

Outline

This paper points out the problems of existing agent benchmarks for evaluating the performance of AI agents and proposes a new guideline, Agentic Benchmark Checklist (ABC), to solve them. It shows that existing benchmarks can under- or overestimate agent performance due to problems in task setting or reward design. For example, SWE-bench Verified lacks test cases, and TAU-bench considers empty responses as successful. ABC is created by synthesizing benchmark building experiences, best practice investigations, and previously reported problems, and is applied to CVE-Bench, which has a complex evaluation design, and shows an effect of reducing performance overestimation by 33%.

Takeaways, Limitations

Takeaways: It provides systematic guidelines (ABC) to improve the reliability of AI agent benchmarks, which can improve the accuracy of AI agent performance evaluation. It can help identify and improve the design and evaluation problems of existing benchmarks.
Limitations: Additional validation is needed to determine whether ABC is applicable to all types of agent benchmarks. The application of ABC may increase the complexity of benchmark development. Not all items in ABC may have the same importance for all benchmarks.
👍