[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

$\Tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Created by
  • Haebom

Author

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan

Outline

Existing conversational AI benchmarks simulate single-control environments where only the AI agent interacts with the environment using tools, while the user remains a passive information provider. This differs from real-world scenarios such as technical assistance, where the user must actively modify the state of the (shared) environment. To address this difference, this paper presents $\tau^2$-bench, which models the telecom dual-control domain with Dec-POMDP, where both the agent and the user act in a shared dynamic environment using tools. $\tau^2$-bench tests both agent coordination and communication, and provides a fine-grained analysis of agent performance through a configurable task generator, a reliable user simulator tightly coupled to the environment, and multiple elimination experiments that separate inference errors from communication/coordination errors. Experimental results show that the agent’s performance degrades significantly when it transitions to dual-control without the user, highlighting the challenges of guiding the user. $\tau^2$-bench provides a controlled testing environment for agents that need to effectively reason and guide the user’s actions.

Takeaways, Limitations

Takeaways:
We present a new benchmark $\tau^2$-bench for a dual-control environment that overcomes the limitations of existing single-control environments and considers active participation of users.
It is possible to evaluate not only the agent's reasoning ability, but also its ability to effectively communicate and cooperate with users.
Create diverse and verifiable tasks with controllable complexity through a configurable task generator.
Achieving high similarity to the real environment through user simulator.
Provides clear directions for improving agent performance by separating and analyzing inference errors and communication/coordination errors.
Limitations:
Specialized in the Telecom domain, further research is needed on generalizability to other domains.
Ongoing research is needed to improve the realism and diversity of user simulators.
Potential increase in computational cost due to the complexity of the Dec-POMDP model.
👍