Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Created by
  • Haebom

Author

Jungkoo Kang

Outline

To address the lack of scalable evaluation data for improving the planning and inference capabilities of large-scale language models (LLMs), this paper presents NL2Flow, a pipeline for automatically generating and evaluating workflow planning problems. NL2Flow parameterizes the problem into a structured intermediate representation, which is then translated into natural language and formal PDDL. Using a dataset of 2,296 low-difficulty problems, we evaluate several open-source, directed-tuned LLMs. The best-performing model achieves a success rate of 86% for generating valid plans and 69% for generating optimal plans (for solvable problems). Regression analysis reveals that the impact of problem characteristics on plan generation varies depending on the model and prompt design. Specifically, converting the natural language problem into a structured JSON representation and then performing symbolic planning significantly improves the success rate, suggesting the benefits of neural-symbolic integration. As LLM inference scales to more complex tasks, it is crucial to understand the sources of error within the system.

Takeaways, Limitations

Takeaways:
NL2Flow provides a scalable dataset generation pipeline for evaluating LLM planning and inference capabilities.
We demonstrate that transforming natural language problems into structured representations improves the plan generation performance of LLM, suggesting the utility of neural symbol integration.
We analyze factors (model, prompts, and problem characteristics) that affect the plan generation performance of LLM and suggest future research directions.
We emphasize the importance of analyzing and resolving error causes to improve the performance of LLM inference.
Limitations:
Since we have only evaluated 2,296 low-difficulty problems to date, the performance of LLM on high-difficulty problems requires further research.
Further review is needed of the diversity and complexity of problems generated by NL2Flow.
More evaluation of different types of LLMs is needed.
👍