[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution

Created by
  • Haebom

Author

Alexandru Coca, Mark Gaynor, Zhenxing Zhang, Jianpeng Cheng, Bo-Hsiang Tseng, Pete Boothroyd, H ector Martinez Alonso, Diarmuid OS eaghdha, Anders Johannsen

Outline

This paper evaluates the feasibility of building a digital assistant capable of executing complex tasks using large-scale language models (LLMs). Such an assistant generates a task execution program that executes a multi-step goal by combining objects and functions defined in an assistant library based on pre-trained programming knowledge. To this end, we develop the ASPERA framework, which consists of an assistant library simulation and a human-assisted LLM data generation engine. The ASPERA engine guides developers in generating high-quality tasks consisting of complex user queries, simulation states, and corresponding validation programs, thereby addressing data availability and evaluation robustness issues. We also release Asper-Bench, an evaluation dataset consisting of 250 difficult tasks generated using ASPERA, showing that program generation based on a user-defined assistant library significantly increases the difficulty of LLMs compared to code generation without dependencies.

Takeaways, Limitations

Takeaways:
Presenting the possibility of developing a digital assistant capable of executing complex tasks using LLM
Providing ASPERA framework and Asper-Bench dataset for generating high-quality working data
The challenges of creating a custom assistant library-based program are presented from an LLM perspective.
Limitations:
Further research is needed on the performance and generalizability of the ASPERA framework.
Need to expand the size and diversity of the Asper-Bench dataset
Additional research is needed on problems and solutions that may arise when applying to real environments.
👍