PentestJudge is a system for evaluating the performance of penetration testing agents. It uses a Large Language Model (LLM) as a judge, analyzing the agent's state and tool call history to determine whether it meets operational criteria that are difficult to evaluate programmatically. The evaluation criteria are developed in a hierarchical tree structure, decomposing penetration testing tasks into smaller, simpler subtasks, with each leaf node representing a simple yes/no criterion evaluated by PentestJudge. Task nodes are categorized by categories such as operational objectives, operational security, and technology. The LLM judge's scores are compared to those of human experts, using binary classification metrics such as the F1 score to evaluate performance. Evaluating multiple LLM models revealed that the best model achieved an F1 score of 0.83, with models with superior tooling skills performing better than human experts. Stratifying F1 scores by requirement type revealed that models with similar overall scores struggled with certain types of questions. Furthermore, the study demonstrated that low-cost models can evaluate the penetration testing process of high-performance models, suggesting that validation is easier than generation in penetration testing tasks. By sharing this methodology, we hope to facilitate future research on the ability to comprehensively and scalably assess the process quality of AI-based information security agents, enabling their safe use in sensitive operational environments.