This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
SWE-bench and its variants have limitations such as limited repositories, manual work dependency, lack scalability, and have risks of overfitting and data contamination. In this paper, we present SWE-bench-Live, a new benchmark that can be updated in real time, to overcome these limitations . SWE-bench-Live consists of tasks extracted from 93 repositories based on 1,319 GitHub issues created after 2024, and each task provides a dedicated Docker image for reproducible execution. It automates the process from instance creation to environment configuration through the automated curation pipeline, \method, to enable scalability and continuous updates. It shows performance differences compared to existing benchmarks, and performs detailed analysis according to repository source, issue recency, and task difficulty.
Takeaways, Limitations
•
Takeaways:
◦
We present SWE-bench-Live, a benchmark that can be updated in real time, to overcome the limitations of existing benchmarks (dependence on manual work, storage limitations, and data corruption risk).
◦
Scalability and continuous updates possible through automated curation pipeline.
◦
Reflects a realistic software development environment, including various repositories and up-to-date issues.
◦
Evaluate and analyze the performance of modern LLM and agent frameworks to identify performance differences.
•
Limitations:
◦
Benchmark data to date is limited to GitHub issues from 2024 onwards (time constraints on data).
◦
\Method The detailed description of the specific algorithm and performance of the pipeline may be lacking.
◦
The level of support for various programming languages and development environments is not explicitly stated.