Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ASE: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Created by
  • Haebom

Author

Keke Lian, Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li, Dong Zhang

Outline

This paper highlights the imperative for security evaluation of generated code, driven by the increasing use of large-scale language models (LLMs) in software engineering. Existing benchmarks lack relevance to real-world AI programming scenarios, making them inadequate for assessing the practical security risks associated with AI-generated code in real-world settings. To address this issue, this paper presents AI Code Generation Security Evaluation (ASE), a repository-level evaluation benchmark designed to accurately reflect real-world AI programming tasks. Evaluations of leading LLMs using ASE reveal that current LLMs struggle with secure coding, and the complexity of repository-level scenarios presents challenges to LLMs that typically perform well on code fragment-level tasks. Furthermore, we demonstrate that larger inference budgets do not necessarily lead to better code generation. These observations provide valuable insights into the current state of AI code generation, assist developers in selecting the most appropriate models for their tasks, and lay the foundation for improving LLMs to generate secure and efficient code in real-world applications.

Takeaways, Limitations

Takeaways:
A new benchmark, ASE, reflects real-world AI programming scenarios.
Revealing the limitations of the current LLM's ability to generate secure code.
Analyze the impact of storage-level operation complexity on LLM performance.
We found no correlation between inference budget and code generation quality.
LLM Selection for Developers and Directions for LLM Improvement
Limitations:
Further research is needed to determine the generalizability of the ASE benchmark.
Need to expand assessments for various LLMs and programming languages
Benchmark improvements are needed to reflect more complex and diverse real-world scenarios.
👍