Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

An Empirical Study of Vulnerabilities in Python Packages and Their Detection

Created by
  • Haebom

Author

Haowei Quan, Junjie Wang, Xinzhe Li, Terry Yue Zhuo, Xiao Chen, Xiaoning Du

Outline

To address the lack of research on the effectiveness of Python package vulnerability detection tools, this paper introduces PyVul, the first comprehensive Python package vulnerability benchmark set. PyVul contains 1,157 publicly reported and developer-verified vulnerabilities, each associated with an affected package. It provides annotations at the commit and function levels to accommodate a variety of detection techniques, and achieves 100% commit-level accuracy and 94% function-level accuracy through an LLM-based data cleansing method. Distribution analysis of PyVul reveals that Python package vulnerabilities span a wide range of programming languages and types, suggesting that multilingual Python packages may be more susceptible to vulnerabilities. We uncover a significant gap between the performance of existing tools and the requirements for identifying security issues in real-world Python packages. Through an empirical review of top CWEs, we assess the limitations of current detection tools and highlight the need for future improvements.

Takeaways, Limitations

Takeaways:
PyVul, the first large-scale, accurate Python package vulnerability benchmark
Identifying Different Types of Python Package Vulnerabilities and Their Multilingual Correlation
Presenting the performance limitations and need for improvement of existing vulnerability detection tools.
Multilingual Python packages pose increased vulnerability risk
Limitations:
PyVul's vulnerability data is limited to publicly reported and developer-verified vulnerabilities. Undiscovered vulnerabilities may not be reflected.
Due to limitations of the LLM-based data cleaning method, the function-level accuracy is not 100% (94%).
Lack of detailed description of the types and limitations of the state-of-the-art detectors used in the analysis.
Lack of specific proposals for future development directions.
👍