Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Created by
  • Haebom

Author

Monoshi Kumar Roy, Simin Chen, Benjamin Steenhoek, Jinjun Peng, Gail Kaiser, Baishakhi Ray, Wei Le

Outline

This paper proposes CodeSense, the first benchmark to provide a spectrum of fine-grained code inference tasks relevant to software engineering (SE) tasks. This benchmark collects Python, C, and Java software projects from real-world repositories and their corresponding test execution traces to build a ground truth dataset for fine-grained semantic inference tasks. Furthermore, it conducts a comprehensive evaluation of state-of-the-art LLMs and demonstrates the performance gap in LLMs' ability to handle fine-grained inference tasks. Beyond the benchmark, dataset, and evaluation, this paper also provides an execution tracing framework and toolset to easily collect ground truth for fine-grained SE inference tasks.

Takeaways, Limitations

Takeaways:
A new benchmark (CodeSense) is proposed to evaluate the code reasoning ability of LLMs for real-world software engineering tasks.
Building a real-world code-based dataset for fine-grained code inference tasks.
Presentation of the model Limitations through performance evaluation of state-of-the-art LLM.
Contribute to future benchmark building and model training by providing an execution tracing framework and toolset.
Limitations:
Despite the use of prompting techniques such as chain-of-thought and in-context learning, LLM's performance is limited due to its lack of fundamental code semantics.
The performance gap presented in this paper suggests the need for further research to improve LLM's code inference capabilities.
👍