Introducing a new benchmark called HonestCyberEval. This benchmark is designed to assess the capabilities and risks of AI models against automated software exploitation, focusing on the ability of AI models to detect and exploit vulnerabilities in real-world software systems. Using an Nginx web server repository with synthetic vulnerabilities, we evaluated several leading language models, including OpenAI's GPT-4.5, o3-mini, o1, and o1-mini; Anthropic's Claude-3-7-sonnet-20250219, Claude-3.5-sonnet-20241022, and Claude-3.5-sonnet-20240620; Google DeepMind's Gemini-1.5-pro; and OpenAI's previous GPT-4o model. The results show significant differences in the success rates and effectiveness of these models. o1-preview achieved the highest success rate (92.85%), while o3-mini and Claude-3.7-sonnet-20250219 offered cost-effective but lower-success rate alternatives. This risk assessment provides a foundation for systematically assessing AI cyber risks in realistic cyberattack operations.