To address the lack of scalable evaluation data for improving the planning and inference capabilities of large-scale language models (LLMs), this paper presents NL2Flow, a pipeline for automatically generating and evaluating workflow planning problems. NL2Flow parameterizes the problem into a structured intermediate representation, which is then translated into natural language and formal PDDL. Using a dataset of 2,296 low-difficulty problems, we evaluate several open-source, directed-tuned LLMs. The best-performing model achieves a success rate of 86% for generating valid plans and 69% for generating optimal plans (for solvable problems). Regression analysis reveals that the impact of problem characteristics on plan generation varies depending on the model and prompt design. Specifically, converting the natural language problem into a structured JSON representation and then performing symbolic planning significantly improves the success rate, suggesting the benefits of neural-symbolic integration. As LLM inference scales to more complex tasks, it is crucial to understand the sources of error within the system.