[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

Created by
  • Haebom

Author

Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh

Outline

This paper points out the Limitations of the existing benchmarks used in the software engineering field, especially the SWE-bench dataset, and proposes a new benchmark, SWE-MERA, to solve it. SWE-bench points out that the data pollution problem (direct solution leakage and inappropriate test cases) is serious and reduces the reliability, and SWE-MERA aims to solve this problem by automatically collecting real GitHub issues and conducting rigorous quality verification. It currently provides about 10,000 potential tasks and 300 samples, and the evaluation result using the Aider coding agent clearly shows the performance difference of the state-of-the-art LLMs. The performance of more than a dozen state-of-the-art LLMs is evaluated on tasks collected from September 2024 to June 2025.

Takeaways, Limitations

Takeaways:
We reveal data contamination issues in the existing SWE-bench dataset and suggest the need for a new benchmark.
We propose a practical benchmark SWE-MERA using real GitHub issues and build an automated data collection and quality verification pipeline.
We compare and evaluate the performance of various state-of-the-art LLMs and demonstrate the model's differentiation.
Contribute to the advancement of LLM in Software Engineering through continuously updated dynamic benchmarks.
Limitations:
The benchmark is limited in scale, with only 300 samples out of 10,000 potential tasks currently available.
There may be a lack of specific details about SWE-MERA's quality assurance process.
Evaluation results may be dependent on a specific coding agent.
Since this dataset is based on GitHub issues, it may be biased towards certain types of software engineering problems.
👍