This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper addresses the innovation of how large-scale language model-based agents modify their environments using tools, and emphasizes the need to consider complex factors such as tool-to-tool relationships, environmental feedback, and previous decisions, unlike traditional NLP tasks. To overcome the limitations of existing studies that fail to consider the influence of various factors, we present $C^3$-Bench, an open-source benchmark that integrates the concept of attacks and applies univariate analysis to identify key factors affecting agent robustness. We present three tasks (exploring complex tool relationships, processing important hidden information, and managing dynamic decision paths), detailed metrics, an innovative data collection algorithm, and a reproducible evaluation method, and conduct experiments on 49 key agents to observe significant agent shortcomings in tool dependency, long-term context information dependency, and frequent policy type switching. $C^3$-Bench aims to expose model vulnerabilities through these tasks and to facilitate research on the interpretability of agent performance. The benchmark is publicly available at https://github.com/yupeijei1997/C3-Bench .