Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HonestCyberEval: An AI Cyber Risk Benchmark for Automated Software Exploitation

Created by
  • Haebom

Author

Dan Ristea, Vasilios Mavroudis

Outline

Introducing a new benchmark called HonestCyberEval. This benchmark is designed to assess the capabilities and risks of AI models against automated software exploitation, focusing on the ability of AI models to detect and exploit vulnerabilities in real-world software systems. Using an Nginx web server repository with synthetic vulnerabilities, we evaluated several leading language models, including OpenAI's GPT-4.5, o3-mini, o1, and o1-mini; Anthropic's Claude-3-7-sonnet-20250219, Claude-3.5-sonnet-20241022, and Claude-3.5-sonnet-20240620; Google DeepMind's Gemini-1.5-pro; and OpenAI's previous GPT-4o model. The results show significant differences in the success rates and effectiveness of these models. o1-preview achieved the highest success rate (92.85%), while o3-mini and Claude-3.7-sonnet-20250219 offered cost-effective but lower-success rate alternatives. This risk assessment provides a foundation for systematically assessing AI cyber risks in realistic cyberattack operations.

Takeaways, Limitations

Takeaways:
We present HonestCyberEval, a new benchmark that evaluates the ability of AI models to exploit vulnerabilities in real-world software systems.
Provides comparison and analysis of the automated software exploitation capabilities of various state-of-the-art language models.
Provides insight into optimal model selection by analyzing the correlation between model success rate and cost-effectiveness.
Establishing a systematic framework for AI cyber risk assessment in realistic cyberattack scenarios.
Limitations:
The assessment is limited to the Nginx web server and synthetic vulnerabilities, so generalization to other software systems or real-world vulnerabilities is limited.
The types and versions of language models used in the evaluation may be limited. Further evaluation of a wider range of models is needed.
Consideration needs to be given to the realism of synthetic vulnerabilities and the difference between them and actual vulnerabilities.
👍