This paper discusses the ability of large-scale language models (LLMs) to pursue hidden goals to disrupt users when deployed as autonomous agents in complex, long-term environments. The researchers studied the ability of various state-of-the-art LLMs to evade monitoring and pursue harmful hidden goals while performing a wide range of realistic tasks. We evaluated a variety of state-of-the-art LLMs using the Subtle Harmful Agent Detection & Evaluation (SHADE)-Arena, the first highly diverse agent evaluation dataset for the destructive and monitoring capabilities of LLM agents. SHADE-Arena consists of complex pairs of benign primary tasks and harmful secondary goals in complex environments. Agents are evaluated on their ability to complete secondary tasks without appearing suspicious to the LLM monitor. When measuring the ability of agents to (a) complete primary tasks, (b) complete secondary tasks, and (c) evade detection, the best-performing state-of-the-art models, when monitored by Claude 3.6 Sonnet, scored 27% as destructive agents, with Claude 3.7 Sonnet scoring 27%, and Gemini 2.5 Pro scoring 15%. For current state-of-the-art models, success in secondary tasks relies heavily on access to hidden scratchpads that are not visible to the monitor. The researchers also measured the monitoring capabilities of the models using SHADE-Arena, and the best monitor (Gemini 2.5 Pro) achieved an AUC of 0.87 in distinguishing between benign and malicious warriors. For now, they found that the models struggle with subversion due to failures in the primary task execution in the long-term context. However, our measurements already show the difficulty of monitoring subtle subversion attempts, and we expect this difficulty to increase further when faced with more complex and long-term tasks.