Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Created by
  • Haebom

Author

Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu

Outline

The explosion of scholarly literature is making manual generation of scientific surveys increasingly impossible. While large-scale language models hold promise for automating this process, the lack of standardized benchmarks and evaluation protocols hinders progress in this field. To address this critical gap, we introduce SurGE (Survey Generation Evaluation), a novel benchmark for scientific survey generation in computer science. SurGE consists of (1) a test instance corpus containing each topic description, expert-written surveys, and the full set of cited references, and (2) a large-scale academic corpus of over one million articles. We also propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Evaluations of various LLM-based methodologies reveal significant performance gaps, demonstrating that even advanced agent frameworks struggle with the complexity of survey generation, highlighting the need for future research in this area. All code, data, and models are open-source at https://github.com/oneal2000/SurGE에서 .

Takeaways, Limitations

Takeaways:
We present SurGE, a new benchmark for scientific research generation in computer science, providing a foundation for objective evaluation of research.
We propose an automated evaluation framework to systematically measure the quality of research.
By evaluating the performance of various LLM-based methodologies, we suggest limitations of existing technologies and future research directions.
By providing all code, data, and models as open source, we contribute to the activation and advancement of related research.
Limitations:
The SurGE benchmark is limited to the field of computer science and has limitations in generalizing to other fields.
Automated assessment frameworks are not yet perfect and may not be able to completely replace human assessments.
The methodology presented in this paper has the potential to improve performance, and further research is needed.
👍