Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

Created by
  • Haebom

Author

Peijie Yu, Yifan Yang, Jinjian Li, Zelong Zhang, Haorui Wang, Xiao Feng, Feng Zhang

Outline

This paper addresses the innovation of how large-scale language model-based agents modify their environments using tools, and emphasizes the need to consider complex factors such as tool-to-tool relationships, environmental feedback, and previous decisions, unlike traditional NLP tasks. To overcome the limitations of existing studies that fail to consider the influence of various factors, we present $C^3$-Bench, an open-source benchmark that integrates the concept of attacks and applies univariate analysis to identify key factors affecting agent robustness. We present three tasks (exploring complex tool relationships, processing important hidden information, and managing dynamic decision paths), detailed metrics, an innovative data collection algorithm, and a reproducible evaluation method, and conduct experiments on 49 key agents to observe significant agent shortcomings in tool dependency, long-term context information dependency, and frequent policy type switching. $C^3$-Bench aims to expose model vulnerabilities through these tasks and to facilitate research on the interpretability of agent performance. The benchmark is publicly available at https://github.com/yupeijei1997/C3-Bench .

Takeaways, Limitations

Takeaways:
We present a new benchmark, $C^3$-Bench, for evaluating the robustness of large-scale language model-based agents.
We reveal vulnerabilities in agents, including tool dependency, long-term context information dependency, and frequent policy type switching.
Presenting new directions for studying the interpretability of agent performance.
Released as open source and contributing to the research community.
Limitations:
Lack of consideration of multivariate interaction effects based on univariate analysis.
The possibility that the presented task may not fully reflect all the complexities of the real environment.
Further review is needed of the subjectivity and generalizability of the evaluation indicators.
👍