This paper addresses the lack of scalable and reliable evaluation data to improve the planning and inference capabilities of large-scale language models (LLMs). To achieve this, we select a suitable domain, automated workflow generation, and present NL2Flow, a fully automated system for generating planning problems using natural language, structured intermediate representations, and formal PDDL. NL2Flow generates a dataset of 2,296 low-difficulty problems and evaluates several open-source, directive-tuned LLMs without task-specific optimization or architecture modification. The evaluation results show that the best-performing model achieves a success rate of 86% for generating valid plans and 69% for generating optimal plans for problems with feasible plans. Regression analysis demonstrates that the impact of problem characteristics varies depending on the model and prompt design. Furthermore, we investigate the potential of LLM as a natural language-to-JSON converter for workflow definitions and evaluate its translation performance on natural language workflow descriptions to facilitate integration with subsequent symbolic computation tools and symbolic planners. Converting natural language into a JSON representation of the workflow problem yielded lower success rates than directly generating a plan, suggesting that unnecessary decomposition of the inference task can degrade performance and highlighting the advantages of models capable of direct inference from natural language to actions. As LLM inference scales to increasingly complex problems, understanding the evolving bottlenecks and sources of error within these systems is crucial.