This paper presents CUAHarm, a novel benchmark for assessing the exploitability of computer-assisted agents (CUAs) that autonomously control computers to perform multi-step tasks. CUAHarm consists of 104 expert-generated, realistic exploitation scenarios, including firewall disablement, data exfiltration, and backdoor installation. It also includes a sandbox environment with rule-based, verifiable rewards for measuring the success rate of CUA operations. We evaluated state-of-the-art LLMs, including GPT-5, Claude 4 Sonnet, Gemini 2.5 Pro, Llama-3.3-70B, and Mistral Large 2, and found that they perform malicious actions with high success rates (e.g., 90% for Gemini 2.5 Pro) without jailbreaking prompts. We also found that newer models, previously considered more secure by existing safety benchmarks, tend to be more vulnerable to exploitation as CUAs (e.g., Gemini 2.5 Pro is more secure than Gemini 1.5 Pro). Furthermore, we demonstrate that while robust against common malicious prompts (e.g., bomb-making) when operating as a chatbot, it may be unsafe when operating as a CUA. Our evaluation of UI-TARS-1.5, a leading agent framework, revealed that while performance is improved, the risk of exploitation also increases. To mitigate the exploitation risk of CUAs, we explored a method for monitoring CUA behavior using LLM and found that it is significantly more challenging than monitoring traditional insecure chatbot responses. Thought process monitoring yielded some performance gains, but the average monitoring accuracy was only 77%. Hierarchical summarization strategies improved performance by up to 13%, but monitoring remains unreliable. This benchmark will be publicly released to facilitate risk mitigation research.