Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Plan Verification for LLM-Based Embodied Task Completion Agents

Created by
  • Haebom

Author

Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tur , Gokhan Tur

Outline

This paper addresses the issue that large-scale language model (LLM)-based task planning for implemented AI and its corresponding human demonstration can degrade policy quality due to unnecessary actions, redundant exploration, and logical errors. To address this, we propose an iterative validation framework in which a judgment LLM critiques action sequences and a planning LLM applies corrections. Unlike rule-based approaches, this method relies on natural language prompting, enabling broad generalization across a variety of error types, including irrelevant actions, contradictions, and missing steps. On a manually annotated action set from the TEACh implementation AI dataset, the proposed framework achieves up to 90% recall and 100% precision on four state-of-the-art LLMs (GPT-4-mini, DeepSeek-R1, Gemini 2.5, and LLaMA 4 Scout). The refined loop converges quickly, with 96.5% of sequences sufficient for at most three iterations, improving both time efficiency and spatial action composition. Importantly, this method preserves human error recovery patterns, supporting future research on robust corrective behavior. By establishing plan verification as a reliable LLM function for spatial planning and behavior improvement, this study provides a scalable path to high-quality training data for imitation learning in implemented AI.

Takeaways, Limitations

Takeaways:
We demonstrate that the quality of AI task plans can be improved through an iterative plan verification framework using LLM.
A natural language prompting-based approach ensures generalizability across various types of errors.
Improved time efficiency and spatial behavior organization.
Contributes to the study of robust corrective behavior by preserving human error recovery patterns.
We present a scalable method for generating high-quality training data for imitation learning.
Limitations:
Currently, the experimental results are limited to the TEACh dataset. Further research is needed to determine generalizability to other datasets.
It depends on the performance of LLM, and the limitations of LLM may affect the results.
Verification of generalization performance for complex tasks or diverse situations is required.
👍