Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ASE: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Created by
  • Haebom

Author

Keke Lian, Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Miaoqian Lin, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li, Dong Zhang

Outline

This paper highlights the growing need for rigorous security evaluation of generated code due to the increasing use of large-scale language models (LLMs) in software engineering. Existing benchmarks lack relevance to real-world AI-assisted programming scenarios, making them inadequate for assessing the practical security risks associated with AI-generated code in operational environments. To address this issue, this paper presents AI Code Generation Security Evaluation (ASE), a repository-level evaluation benchmark designed to accurately reflect real-world AI programming tasks. ASE provides a comprehensive and reliable framework for assessing the security of AI-generated code. ASE evaluation results for leading LLMs reveal that current LLMs still struggle with secure coding. The complexity of repository-level scenarios poses challenges for LLMs, which typically perform well on code fragment-level tasks. Furthermore, larger inference budgets do not necessarily lead to better code generation. These observations provide valuable insights into the current state of AI code generation and help developers identify the most appropriate models for real-world tasks. They also lay the foundation for improving LLMs to generate secure and efficient code in real-world applications.

Takeaways, Limitations

Takeaways: We present a new benchmark, ASE, for security evaluation of real-world AI-assisted programming scenarios. We reveal the limitations of current LLM's secure code generation capabilities. We analyze the impact of the complexity of repository-level operations on LLM performance. We confirm the lack of correlation between inference budget and code generation quality. We suggest ways to improve LLM for real-world applications.
Limitations: Further research is needed to determine the generalizability of the ASE benchmark. Further vulnerability analysis of LLM against various types of security vulnerabilities is needed. Evaluation of a wider range of LLM models is also needed.
👍