Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

Created by
  • Haebom

Author

Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Aaditya Bhatia, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, Ahmed E. Hassan

Outline

SPICE is a scalable, automated pipeline for generating high-quality labeled datasets essential for learning and evaluating foundational models in software engineering. It automatically annotates SWE-bench-style datasets with problem clarity, test coverage, and effort estimation. It combines context-aware code exploration, evidence-based prompting, and multi-pass consensus to produce labels that closely resemble expert annotations. It is built on the experience of labeling over 800 SWE-Gym instances and achieves high agreement with human-labeled SWE-bench Verified data. It dramatically reduces the cost of labeling 1,000 instances from approximately $100,000 for manual annotation to $5.10. We also release SPICE Bench, a new dataset consisting of 6,802 SPICE-labeled instances from 291 open-source projects in SWE-Gym.

Takeaways, Limitations

Takeaways:
Significantly reduce the cost of building large, high-quality datasets for training software engineering foundational models.
Contribute to the research community through SPICE tools and SPICE Bench datasets (providing a dataset more than 13 times larger than SWE-bench Verified).
Achieve expert-level accuracy with our automated labeling pipeline.
Limitations:
SPICE performance may vary depending on the characteristics of the codebase used.
Currently supported annotation types may be limited (problem clarity, test coverage, effort estimation).
Because this is not a fully automated system, some manual verification or adjustments may be required (e.g., multi-pass consensus process).
👍