Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TRAIL: Trace Reasoning and Agentic Issue Localization

Created by
  • Haebom

Author

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian

Outline

This paper highlights the need for scalable and systematic evaluation of the complex traces generated by agent workflows as they are increasingly adopted across a variety of domains. Existing evaluation methods rely on manual, domain-specific human analysis of workflow traces, which is less scalable as the complexity and volume of agent outputs increase. In this paper, we (1) clarify the need for a robust and dynamic evaluation method for agent workflow tracing, (2) present a formal taxonomy of error types encountered in agent systems, and (3) introduce TRAIL, a large-scale human-annotated trace dataset of 148 traces built on this taxonomy. TRAIL curates traces from both single- and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval, ensuring ecological validity. Our evaluation results show that state-of-the-art long-context LLMs underperform in trace debugging, with the Gemini-2.5-pro model scoring 11% on TRAIL. By making our dataset and code public, we support and accelerate future research in the area of scalable evaluation of agent workflows.

Takeaways, Limitations

Takeaways: Provide a new standard dataset (TRAIL) for evaluating agent workflows. Provide a formal taxonomy of agent system error types. Empirically demonstrate poor performance in debugging agent workflow traces in LLM. Provide open datasets and code for future research.
Limitations: The TRAIL dataset may be limited in size. The classification scheme may not cover all types of agent system errors. The evaluation results may be limited to specific LLM models. The diversity of real-world applications may not be fully reflected.
👍