AbGen is the first benchmark designed to evaluate the ability to design ablation studies for scientific research. It consists of 1,500 expert-annotated examples extracted from 807 NLP papers, and tasks LLMs with generating detailed ablation study designs for specific modules or processes in a given research context. The evaluation results on leading LLMs such as DeepSeek-R1-0528 and o4-mini show significant performance differences between these models and experts in terms of the importance, fidelity, and soundness of ablation study design. Furthermore, current automated evaluation methods show significant differences compared to human evaluations, suggesting that they are unreliable for this task. To investigate this further, we developed AbGen-Eval, a meta-evaluation benchmark designed to evaluate the reliability of common automated evaluation systems used to measure LLM performance in this task. AbGen-Eval examines a variety of LLM-as-Judge systems, providing insights into the development of more effective and reliable LLM-based evaluation systems for complex scientific tasks.