With the increasing use of large-scale language models (LLMs) for data generation, the importance of generating evaluation benchmarks has increased. This paper examines whether LLMs can meet the requirements for generating inference-based text benchmarks through two case studies. Specifically, we evaluate LLM-generated versions of two high-quality reading comprehension datasets—CondaQA, which evaluates negation inference, and DROP, which evaluates quantification inference—and compare them to crowdsourced original datasets. We find that LLMs can generate valid versions of the original datasets at low cost, following the guidelines of the original datasets, but are less challenging than human-generated benchmarks.