Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Created by
  • Haebom

Author

Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam

Outline

This paper highlights the Limitations flaw in SWE-Bench Verified, a benchmark for evaluating software engineering capabilities of large-scale language models (LLMs). While recent LLMs demonstrate high performance on SWE-Bench, this may be due to data memorization or contamination rather than actual problem-solving ability. To verify this, the paper presents two diagnostic tasks: identifying file paths based solely on issue descriptions and reproducing functions based solely on the current file context and issue descriptions. Experimental results show that while state-of-the-art models exhibit high accuracy on data included in SWE-Bench, their accuracy drops sharply on data not included, raising concerns about the reliability of SWE-Bench's evaluation results. This highlights the need for a more robust and contamination-resistant benchmark for evaluating LLM coding capabilities.

Takeaways, Limitations

Takeaways: This demonstrates that existing benchmarks, such as SWE-Bench Verified, may not accurately assess LLM's real-world problem-solving abilities. A more robust benchmark that prevents data memorization and contamination is needed to evaluate LLM's performance. A new assessment methodology that distinguishes between LLM's generalized problem-solving abilities and memorization abilities is needed.
Limitations: The two diagnostic tasks presented may only assess specific types of problem-solving skills. A more comprehensive benchmark that encompasses a wider range of software engineering tasks is needed. The results may not generalize due to the characteristics of the dataset used in this study.
👍