Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

Created by
  • Haebom

Author

Beichen Guo, Zhiyuan Wen, Yu Yang, Peng Gao, Ruosong Yang, Jiaxing Shen

Outline

This paper proposes SGSimEval, a comprehensive benchmark for automatic survey generation (ASG) systems. We highlight the limitations of existing evaluation methods, including biased metrics, lack of human preferences, and overreliance on LLM-based assessments. We propose a multifaceted evaluation framework that integrates assessments of outline, content, and references, combining LLM-based scores with quantitative metrics. Furthermore, we introduce a human preference metric to assess human-like similarity. Experimental results demonstrate that current ASG systems achieve human-level performance in outline generation, but significant room for improvement in content and reference generation. Furthermore, we confirm that the proposed evaluation metric maintains high consistency with human evaluations.

Takeaways, Limitations

Takeaways:
SGSimEval, a new benchmark for evaluating automated survey generation systems, is presented.
A multifaceted evaluation framework combining LLM-based scores, quantitative indicators, and human preference indicators is presented.
Clearly present the strengths and weaknesses of the current ASG system (outline generation is excellent, but content and reference generation needs improvement)
Development of evaluation metrics that demonstrate high consistency with human evaluations
Limitations:
Further research is needed to determine the generalizability of SGSimEval.
Need to evaluate more diverse types of surveys
Consideration should be given to the subjectivity of human preference assessment and limitations in sample size.
👍