Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CodeMirage: Hallucinations in Code Generated by Large Language Models

Created by
  • Haebom

Author

Vibhor Agarwal, Yulong Pei, Salwa Alamir, Xiaomo Liu

Outline

This paper is the first to study the hallucination phenomenon that large-scale language models (LLMs) cause in code generation. We define 'code hallucination' as problems such as syntactic and logical errors, security vulnerabilities, and memory leaks that LLMs cause in code generation, and comprehensively classify their types. We present the CodeMirage benchmark dataset, which consists of 1,137 hallucinatory code snippets generated by GPT-3.5 based on Python programming problems. We experiment with code hallucination detection methodologies using models such as CodeLLaMA, GPT-3.5, and GPT-4, and show that GPT-4 performs the best on the HumanEval dataset and obtains similar results to the fine-tuned CodeBERT baseline on the MBPP dataset. Finally, we discuss various strategies to mitigate code hallucination.

Takeaways, Limitations

Takeaways:
By systematically addressing the reliability issues of LLM-based code generation, we make an important contribution to the development of a secure and reliable LLM-based code generation system.
This is the first comprehensive study of the code hallucination phenomenon, and the CodeMirage dataset serves as an important foundation for future research.
By comparing and analyzing the code hallucination detection performance of various LLM models, we suggest future model development directions.
Discussion of code hallucination mitigation strategies provides important Takeaways for practical applications.
Limitations:
Current benchmark datasets are limited to Python, and research is needed for other programming languages.
This is a one-shot prompt-based experiment, and research using more complex prompt engineering techniques is needed.
Empirical research on code hallucination mitigation strategies is lacking, and more in-depth research is needed in the future.
There is a lack of analysis on the differences in detection performance for different types of code hallucinations.
👍