Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Created by
  • Haebom

Author

Jungkoo Kang

Outline

This paper addresses the lack of scalable and reliable evaluation data to improve the planning and inference capabilities of large-scale language models (LLMs). To achieve this, we select a suitable domain, automated workflow generation, and present NL2Flow, a fully automated system for generating planning problems using natural language, structured intermediate representations, and formal PDDL. NL2Flow generates a dataset of 2,296 low-difficulty problems and evaluates several open-source, directive-tuned LLMs without task-specific optimization or architecture modification. The evaluation results show that the best-performing model achieves a success rate of 86% for generating valid plans and 69% for generating optimal plans for problems with feasible plans. Regression analysis demonstrates that the impact of problem characteristics varies depending on the model and prompt design. Furthermore, we investigate the potential of LLM as a natural language-to-JSON converter for workflow definitions and evaluate its translation performance on natural language workflow descriptions to facilitate integration with subsequent symbolic computation tools and symbolic planners. Converting natural language into a JSON representation of the workflow problem yielded lower success rates than directly generating a plan, suggesting that unnecessary decomposition of the inference task can degrade performance and highlighting the advantages of models capable of direct inference from natural language to actions. As LLM inference scales to increasingly complex problems, understanding the evolving bottlenecks and sources of error within these systems is crucial.

Takeaways, Limitations

Takeaways:
Presenting a new evaluation method and dataset (NL2Flow) for automated workflow generation using LLM.
Presentation of empirical analysis results on the plan generation capability of LLM (the best-performing model had an 86% success rate in generating valid plans and a 69% success rate in generating optimal plans).
Provides insight into the interplay of problem characteristics, models, and prompt design.
Suggesting directions for improving LLM inference strategies by comparing the efficiency of direct plan generation versus natural language-to-JSON conversion.
Limitations:
Currently, only low-difficulty problems are evaluated (further research is needed to determine LLM performance on complex problems).
Research limited to a specific domain (automated workflow generation) (generalizability to other domains needs to be verified)
The LLM models used are limited to open-source, fine-tuned models (evaluation of the latest, large-scale models is required).
👍