Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

Created by
  • Haebom

Author

Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin

Outline

This paper proposes DeepScholar-bench, a novel benchmark for evaluating generative research synthesis systems. Existing question-answering benchmarks focus on short, factual responses, and their expert-curated datasets are often outdated or prone to data contamination, failing to adequately capture the complexity and evolving nature of real-world research synthesis tasks. DeepScholar-bench focuses on the real-world research synthesis task of extracting queries from the latest, high-quality arXiv articles and generating relevant research sections. This involves retrieving, synthesizing, and citing relevant research. The evaluation framework comprehensively assesses three key aspects: knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, an efficiently implemented reference pipeline using the LOTUS API, and systematically evaluate existing open-source systems, search AI, OpenAI's DeepResearch, and DeepScholar-base using the DeepScholar-bench framework. We find that DeepScholar-base establishes a robust baseline that achieves competitive or better performance. This shows that DeepScholar-bench is not yet saturated, as no system exceeds $19$ in any metric .

Takeaways, Limitations

Takeaways:
DeepScholar-bench, a new benchmark for evaluating generative research systems, is presented.
Benchmark design that reflects actual research tasks enables realistic evaluation.
Presenting a powerful reference system called DeepScholar-base
Providing important criteria for the development of the field of generative research
Increasing research scalability through open source code disclosure
Limitations:
DeepScholar-bench's score is still low (less than 19% of the best), leaving significant room for improvement.
Further research is needed on generalizability with datasets limited to arXiv papers.
Despite the comprehensive nature of the evaluation indicators, there is a need for additional evaluation of other aspects.
Possible accessibility limitations due to LOTUS API dependency
👍