Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Created by
  • Haebom

Author

Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, Xiaodong Gu

Outline

This paper proposes a multi-resolution debugger (MGDebugger) to overcome the limitations of code generation based on large-scale language models (LLMs). MGDebugger isolates, identifies, and resolves bugs in generated code at various levels of granularity, ranging from low-level syntax errors to high-level algorithmic flaws. It decomposes problematic code into a hierarchical tree of subfunctions, each level representing an error at a specific granularity. Using an LLM-based Python executor, it traces the execution of subfunctions and monitors variable states to accurately identify errors. Accuracy and efficiency are improved through subfunction-level testing and bottom-up, iterative bug resolution. Experimental results using the HumanEval and HumanEvalFix datasets demonstrate its superior performance compared to existing debugging systems.

Takeaways, Limitations

Takeaways:
We present a new debugging method that can contribute to improving the accuracy of LLM-based code generation.
Proven effective for complex problem solving by resolving bugs at various levels of granularity.
Accurate error identification and correction possible through LLM-based simulation runner.
Experimentally verified performance improvement over existing systems on the HumanEval and HumanEvalFix datasets.
Limitations:
Currently, the system is specialized for Python, and its applicability to other programming languages requires further research.
Further validation of the performance and reliability of the LLM-based simulation runner is needed.
The performance of handling very complex or special types of bugs requires further experimentation.
Due to limitations of LLM, it is possible that certain types of bugs may not be detected.
👍