This paper highlights the need for scalable and systematic evaluation of the complex traces generated by agent workflows as they are increasingly adopted across a variety of domains. Existing evaluation methods rely on manual, domain-specific human analysis of workflow traces, which is less scalable as the complexity and volume of agent outputs increase. In this paper, we (1) clarify the need for a robust and dynamic evaluation method for agent workflow tracing, (2) present a formal taxonomy of error types encountered in agent systems, and (3) introduce TRAIL, a large-scale human-annotated trace dataset of 148 traces built on this taxonomy. TRAIL curates traces from both single- and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval, ensuring ecological validity. Our evaluation results show that state-of-the-art long-context LLMs underperform in trace debugging, with the Gemini-2.5-pro model scoring 11% on TRAIL. By making our dataset and code public, we support and accelerate future research in the area of scalable evaluation of agent workflows.