Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Created by
  • Haebom

Author

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, Daniel Kang

Outline

This paper highlights the need for a realistic benchmark to evaluate the exploitation capabilities of web application vulnerabilities as large-scale language model (LLM) agents increasingly become capable of autonomously conducting cyberattacks. Since existing benchmarks are limited by abstractions from the Capture the Flag competition or lack comprehensive coverage, we present CVE-Bench, a realistic cybersecurity benchmark based on common vulnerabilities and exposures (CVEs) with high severity. CVE-Bench designs a sandbox framework that allows LLM agents to exploit vulnerable web applications in scenarios that mimic real-world environments and effectively evaluate the exploitation. Our evaluation results show that state-of-the-art agent frameworks can resolve up to 13% of vulnerabilities.

Takeaways, Limitations

Takeaways: We provide a foundation for realistically evaluating the LLM agent's ability to exploit web application vulnerabilities through CVE-Bench, a benchmark that mimics real-world cyberattacks. We measure the performance of state-of-the-art agents and provide directions for future research and development.
Limitations: The percentage of vulnerabilities that can be resolved by the currently evaluated LLM agent is relatively low at 13%. There is a need to increase the comprehensiveness of the benchmark by including more diverse and complex CVEs. There is a difficulty in building a sandbox environment that is exactly the same as the real environment.
👍