Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Plan Verification for LLM-Based Embodied Task Completion Agents

Created by
  • Haebom

Author

Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tur , Gokhan Tur

Outline

This paper addresses the issue that large-scale language model (LLM)-based task planning for implemented AI and its corresponding human demonstration can degrade policy quality due to unnecessary actions, redundant exploration, and logical errors. To address this, we propose an iterative validation framework in which the judgment LLM critiques action sequences and the planning LLM applies corrections. This produces progressively cleaner and spatially consistent trajectories. Unlike rule-based approaches, it relies on natural language prompting, enabling broad generalization across a variety of error types, including irrelevant actions, contradictions, and missing steps. On a manually annotated action set from the TEACh implementation AI dataset, the proposed framework achieves up to 90% recall and 100% precision against four state-of-the-art LLMs (GPT-4-mini, DeepSeek-R1, Gemini 2.5, and LLaMA 4 Scout). The refinement loop converges rapidly, with 96.5% of sequences requiring only three iterations, improving both time efficiency and spatial action composition. Importantly, this method supports future research on robust correction behaviors by preserving human error recovery patterns without disrupting them. By establishing plan validation as a reliable LLM function for spatial planning and behavior improvement, it provides a scalable path to high-quality training data for imitation learning in implemented AI.

Takeaways, Limitations

Takeaways:
We demonstrate that the quality of AI task plans can be improved through an iterative plan verification framework using LLM.
Natural language prompting-based approaches allow generalization across different types of errors.
Improves time efficiency and spatial behavior organization.
Contributes to building robust systems by preserving human error recovery patterns.
Provides a scalable method for generating high-quality training data for imitation learning.
Limitations:
The performance of the proposed framework may depend on the performance of the LLM used.
Only the evaluation results for the TEACh dataset are presented, and generalization performance on other datasets requires further validation.
Further research is needed to address the processing performance of complex tasks or exceptional situations.
We do not guarantee complete error removal, and some errors may still remain.
👍