Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Learning to Generate Unit Tests for Automated Debugging

Created by
  • Haebom

Author

Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal

Outline

This paper reveals a tradeoff between generating error-prone unit test inputs while accurately predicting unit test outputs without a correct answer. To address this, we propose UTGen, which trains LLMs to generate error-prone unit test inputs and correct expected outputs based on task descriptions. Since model-generated tests can contain noise, we improve UT output predictions by leveraging test-time calculations via UTDebug. Furthermore, we verify and backtrack edits based on multiple generated UTs to prevent overfitting and effectively support LLM debugging. Experimental results show that UTGen outperforms other LLM-based baseline models by 7.59% on metrics measuring both error-prone UT inputs and correct UT outputs. When combined with UTDebug, it improves the pass@1 accuracy of Qwen2.5 32B by 3.17% and 12.35%, respectively, on more challenging debugging partitions of HumanEvalFix and MBPP+ compared to other LLM-based UT generation baseline models. Furthermore, feedback from the UTGen model based on Qwen2.5 32B improved the debugging performance of state-of-the-art LLMs, such as GPT-4o, by 13.8%. Finally, UTGen demonstrates that using Qwen2.5 7B with the best 10 samples from HumanEval+, it outperforms the state-of-the-art 8B reward model by 4.43% in determining code correctness.

Takeaways, Limitations

Takeaways:
We present a novel approach to resolving the trade-off between generating unit test inputs that reveal errors and predicting accurate outputs.
Improved LLM-based unit test generation and debugging performance with UTGen and UTDebug.
Contributing to improving LLM's ability to judge code correctness
Contributing to improving the debugging performance of cutting-edge LLM
Limitations:
The performance improvements of UTGen and UTDebug may depend on the specific LLM (Qwen2.5) and dataset. Further research is needed to determine generalization performance on other LLMs and datasets.
Need to create unit tests for complex code and evaluate debugging performance.
Further analysis is needed on the effectiveness of UTDebug's overfitting prevention strategy.
Need to evaluate applicability and scalability for large codebases.
👍