Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Created by
  • Haebom

Author

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He

Outline

This study evaluates large-scale language models (LLMs) that generate code from algorithm descriptions in recent NLP papers. This task requires two core competencies: algorithmic understanding (the ability to synthesize information from papers and academic literature to understand implementation logic) and coding expertise (the ability to identify dependencies and correctly implement required APIs). To ensure rigorous evaluation, we present SciReplicate-Bench, a benchmark consisting of 100 tasks from 36 NLP papers published in 2024. This benchmark includes detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent, which interprets algorithmic concepts from the literature, and a Code Agent, which retrieves dependencies from repositories and implements solutions. To evaluate algorithmic understanding, we introduce inference graph accuracy, which quantifies the similarity between the generated inference graph and the reference inference graph derived from code annotations and structure. To assess implementation quality, we use execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various robust non-inference and inference LLMs as baseline models. The best-performing LLM using \ModelName achieved a runtime accuracy of only 39%, highlighting the difficulty of benchmarking. Our analysis revealed that missing or inconsistent algorithm descriptions were a major barrier to successful reproducibility. The benchmark and code are available at https://github.com/xyzCS/SciReplicate-Bench , and the project homepage is available at https://xyzcs.github.io/scireplicate.github.io/에서 .

Takeaways, Limitations

Takeaways:
Provides rigorous evaluation criteria and benchmarks (SciReplicate-Bench) for LLM's algorithm understanding and code generation skills.
Introducing new metrics for evaluating algorithm understanding and code implementation capabilities (inference graph accuracy, execution accuracy, CodeBLEU, repository dependency/API recall).
Clearly highlights the limitations of the current LLM algorithmic reproducibility (low execution accuracy of the best-performing models).
We show that the quality of algorithm description has a significant impact on the success of code generation.
Limitations:
The number of papers and tasks included in the benchmark may be limited.
A comprehensive consideration of evaluation indicators is necessary, and there is a possibility of bias toward specific indicators.
The types of LLMs used may be limited, and there is a need to evaluate a wider range of models.
It is difficult to completely rule out the influence of external factors such as the incompleteness of the algorithm description.
👍