This paper proposes SGSimEval, a comprehensive benchmark for automatic survey generation (ASG) systems. We highlight the limitations of existing evaluation methods, including biased metrics, lack of human preferences, and overreliance on LLM-based assessments. We propose a multifaceted evaluation framework that integrates assessments of outline, content, and references, combining LLM-based scores with quantitative metrics. Furthermore, we introduce a human preference metric to assess human-like similarity. Experimental results demonstrate that current ASG systems achieve human-level performance in outline generation, but significant room for improvement in content and reference generation. Furthermore, we confirm that the proposed evaluation metric maintains high consistency with human evaluations.