Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

Created by
  • Haebom

Author

Spandan Garg, Benjamin Steenhoek, Yufan Huang

Outline

We raise the issue that current benchmarks for evaluating software engineering agents (e.g., SWE-Bench Verified) do not adequately reflect developer interactions in real-world IDE environments, leading to overestimation of agent capabilities. This paper proposes a novel framework that transforms existing benchmarks into real-world user queries. We apply this framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench, and SWE-Bench C#, and verify performance degradation of the agents. This research presents a new paradigm for evaluating conversational chat-based software engineering agents.

Takeaways, Limitations

Takeaways:
We point out the gap between existing benchmarks and real-world environments and propose a new evaluation framework to address this.
By evaluating the agent's performance based on real user queries, we demonstrate that existing benchmarks may overestimate the agent's capabilities.
We present a new evaluation paradigm that may have significant implications for the future development and evaluation of conversational agents.
Limitations:
Because we generated user queries based on data from a specific chat-based agent interaction, generalizability to other agents and environments requires further research.
The scalability and applicability of the proposed framework need to be further verified on various benchmarks and agents.
In addition to simple performance degradation, additional performance analysis, such as error type analysis, is required.
👍