[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Plancraft: an evaluation dataset for planning with LLM agents

Created by
  • Haebom

Author

Gautier Dagan, Frank Keller, Alex Lascarides

Outline

Plancraft is a multimodal evaluation dataset for LLM agents. It provides a text-only and multimodal interface based on the Minecraft creation GUI. It includes the Minecraft wiki for tool usage and Retrieval Augmented Generation (RAG) evaluation, and a hand-crafted planner and an Oracle Retriever to analyze various components of modern agent architectures. It also includes a subset of examples that are intentionally unsolvable for decision evaluation, providing realistic tasks that require the agent to not only complete the task, but also decide whether it is solvable. We benchmark open-source and closed-source LLMs and compare their performance and efficiency to hand-crafted planners. Overall, we find that LLM and VLM struggle with the planning problems presented in Plancraft, and provide suggestions on how to improve their capabilities.

Takeaways, Limitations

Takeaways: We present a new benchmark for evaluating the planning and decision-making abilities of LLM and VLM on realistic problems. We present directions for improving the LLM agent architecture by evaluating the performance of RAG using the Minecraft Wiki and comparing it to a hand-crafted planner. Including unsolvable problems allows us to evaluate the agent's judgment as well as problem-solving ability.
Limitations: Further research is needed on the generalizability of the Minecraft environment and tasks used in the current benchmark. There may be limitations in the size and diversity of the evaluation dataset. Evaluation results that are limited to a specific game environment may make it difficult to generalize to other domains.
👍