Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation

Created by
  • Haebom

Author

Soohan Lim, Joonghyuk Hahn, Hyunwoo Park, Sang-Ki Ko, Yo-Sub Han

Outline

This paper criticizes existing code generation benchmarks for overlooking contract compliance (preconditions and validity constraints) and for failing to consider important aspects of real-world software. To address this, we present PACT, a program evaluation and contract compliance evaluation framework. PACT is the first framework to systematically assess and improve contract compliance alongside functional correctness. By providing a corpus of test suites focused on contract violations, analyzing code generation under various prompting conditions, and introducing novel metrics to quantify contract compliance in test and code generation, PACT exposes errors overlooked by existing benchmarks and evaluates the robustness of LLM-generated code.

Takeaways, Limitations

Takeaways:
We present a framework that can improve the reliability of LLM-generated code by addressing contract compliance issues that existing benchmarks miss.
Prompts including contract violation test cases are demonstrated to be effective in improving the model's contract compliance ability.
Provides a new metric to evaluate the robustness of code generation.
Presenting a systematic methodology for discovering and analyzing errors missed by existing benchmarks.
Increase the reproducibility of research by making code and data open.
Limitations:
Limited scope of test suites extending HumanEval+ and MBPP+.
Further validation is needed to determine the practicality and generalizability of the proposed new indicators.
This may be a result of a specific model or prompting method, and requires application and validation to various models and environments.
Share code and data via github links, but lack additional explanations or guidelines for using the framework.
👍