Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Revisiting Pre-trained Language Models for Vulnerability Detection

Created by
  • Haebom

Author

Youpeng Li, Weiliang Qi, Xuyu Wang, Fuxun Yu, Xinda Wang

Outline

This paper presents a comprehensive study of RevisitVD, a pre-trained language model (PLM) for vulnerability detection (VD). Using a newly constructed dataset, we compare fine-tuning and prompt engineering approaches using 17 PLMs (including small-scale, code-specific PLMs and large-scale PLMs). We evaluate their effectiveness under various training and test settings, generalization ability, and robustness to code normalization, abstraction, and semantic transformations. We find that a PLM incorporating a pre-trained task designed to capture syntactic and semantic patterns of code outperforms general-purpose PLMs or pre-trained or fine-tuned PLMs only on large code corpora. However, we also find that it struggles in real-world scenarios, such as detecting vulnerabilities with complex dependencies, handling changes due to code normalization and abstraction, and identifying semantically vulnerable code transformations. We also highlight that the limited context window of the PLM can lead to significant labeling errors due to truncation.

Takeaways, Limitations

Takeaways: We demonstrate that pre-training that considers syntactic and semantic patterns in code is crucial for improving VD performance. We emphasize the importance of PLM evaluation for real-world VD applications.
Limitations: This presents challenges in applying it to real-world scenarios, including vulnerabilities with complex dependencies, code transformations, and labeling errors due to limited context windows. Suggestions for improvement are needed to ensure real-world application of PLM.
👍