[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Created by
  • Haebom

Author

Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, Caiming Xiong

Outline

In this paper, we present APIGen-MT, a novel framework for generating high-quality data for effective training of AI agents for multi-pass interactions. APIGen-MT consists of two stages: an agent pipeline that generates accurate task blueprints using an LLM reviewer and an iterative feedback loop, and a complete interaction path generation process through simulated human-agent interactions. A series of xLAM-2-fc-r models (1 billion to 70 billion parameters) trained using this data outperform state-of-the-art models such as GPT-4o and Claude 3.5 on $\tau$-bench and BFCL benchmarks, with smaller models outperforming larger ones, especially in multi-pass settings. In this paper, we contribute to the advancement of AI agent research by open-sourcing 5,000 synthetic data paths and the trained xLAM-2-fc-r models.

Takeaways, Limitations

Takeaways:
Presenting an effective framework (APIGen-MT) for generating high-quality multi-pass interaction data.
Development of the xLAM-2-fc-r model series that outperforms existing state-of-the-art models.
Excellent performance of small models in multi-run setups.
Contributing to research advancement through open-sourcing of 5,000 synthetic data and trained models.
Limitations:
Lack of clear validation of the differences between simulated data and real-world data.
Because of the high reliance on LLM reviewers, there is a possibility that reviewer bias may affect the results.
Due to limitations in benchmark evaluation, it may not fully reflect performance in real application environments.
👍