Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PentestJudge: Judging Agent Behavior Against Operational Requirements

Created by
  • Haebom

Author

Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo, Will Pearce

Outline

PentestJudge is a system for evaluating the performance of penetration testing agents. It uses a Large Language Model (LLM) as a judge, analyzing the agent's state and tool call history to determine whether it meets operational criteria that are difficult to evaluate programmatically. The evaluation criteria are developed in a hierarchical tree structure, decomposing penetration testing tasks into smaller, simpler subtasks, with each leaf node representing a simple yes/no criterion evaluated by PentestJudge. Task nodes are categorized by categories such as operational objectives, operational security, and technology. The LLM judge's scores are compared to those of human experts, using binary classification metrics such as the F1 score to evaluate performance. Evaluating multiple LLM models revealed that the best model achieved an F1 score of 0.83, with models with superior tooling skills performing better than human experts. Stratifying F1 scores by requirement type revealed that models with similar overall scores struggled with certain types of questions. Furthermore, the study demonstrated that low-cost models can evaluate the penetration testing process of high-performance models, suggesting that validation is easier than generation in penetration testing tasks. By sharing this methodology, we hope to facilitate future research on the ability to comprehensively and scalably assess the process quality of AI-based information security agents, enabling their safe use in sensitive operational environments.

Takeaways, Limitations

Takeaways:
A novel method for effectively evaluating the operation of penetration testing agents using LLM is presented.
Simplify complex tasks and improve evaluation accuracy through evaluation criteria based on a hierarchical tree structure.
We demonstrate that inexpensive models can evaluate the results of high-performance models, demonstrating ease of validation.
Contributes to improving the reliability of AI-based information security agents and establishing a secure operating environment.
We confirmed that tool use skills affect assessment accuracy.
Limitations:
The current model's F1 score is 0.83, which is not perfect (not a perfect match to human experts).
Differences in assessment accuracy exist for certain types of questions. Models may have strengths and weaknesses.
There is a possibility of subjectivity in the evaluation criteria (depending on the evaluation criteria of human experts).
Further research is needed to determine generalizability across diverse environments and scenarios.
👍