In this paper, we present AIProbe, a novel black-box testing technique that distinguishes between defects in the agent itself (such as defects in its model or policy) and environmental errors that make the task inherently impossible under given environmental conditions when causing undesirable behaviors (including task failures) in autonomous agents. AIProbe generates a variety of environments and tasks using Latin cube sampling, and solves each task using an agent-independent exploration-based planner. By comparing the agent’s performance with the planner’s solution, we identify whether the failure is caused by model or policy errors or unsolvable task conditions. Evaluations on various domains show that AIProbe significantly improves overall and intrinsic error detection over existing techniques, contributing to the reliable deployment of autonomous agents.