This paper proposes DeepScholar-bench, a novel benchmark for evaluating generative research synthesis systems. Existing question-answering benchmarks focus on short, factual responses, and their expert-curated datasets are often outdated or prone to data contamination, failing to adequately capture the complexity and evolving nature of real-world research synthesis tasks. DeepScholar-bench focuses on the real-world research synthesis task of extracting queries from the latest, high-quality arXiv articles and generating relevant research sections. This involves retrieving, synthesizing, and citing relevant research. The evaluation framework comprehensively assesses three key aspects: knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, an efficiently implemented reference pipeline using the LOTUS API, and systematically evaluate existing open-source systems, search AI, OpenAI's DeepResearch, and DeepScholar-base using the DeepScholar-bench framework. We find that DeepScholar-base establishes a robust baseline that achieves competitive or better performance. This shows that DeepScholar-bench is not yet saturated, as no system exceeds $19$ in any metric .