Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language

Created by
  • Haebom

Author

Dimitrios Stamatios Bouras, Yihan Dai, Tairan Wang, Yingfei Xiong, Sergey Mechtaev

Outline

Verifying the correctness of programs written in natural language for software requirements is a challenging problem. Large-scale language models (LLMs) are promising candidates for this task, but they often fail to detect even simple bugs. In this paper, we present HoarePrompt, a novel approach that applies the fundamental ideas of program verification to natural language documents. Inspired by strongest postconditional calculus, HoarePrompt uses LLMs to stepwise generate natural language descriptions of program states reachable from various code points. To manage loops, we propose few-shot-driven k-induction, which adapts the k-induction method widely used in model checking. Once program states are described, HoarePrompt leverages LLMs to assess whether the annotated program complies with the natural language requirements. To evaluate the quality of program correctness classifiers for natural language requirements, we constructed a challenging dataset called CoCoClaNeL, consisting of solutions to programming competition problems. Experimental results show that HoarePrompt improves MCC by 61% compared to directly using Zero-shot-CoT prompts for classification accuracy, and improves MCC by 106% compared to a classifier that evaluates accuracy using LLM-based test generation. The inductive inference mechanism improves MCC by 26%, highlighting the effectiveness of loop management.

Takeaways, Limitations

Takeaways:
We present HoarePrompt, an effective new method for verifying program correctness against software requirements expressed in natural language.
It outperforms LLM-based Zero-shot-CoT prompts and LLM-based test generation methods.
Increased loop management efficiency through few-shot-driven k-induction.
We present a new dataset, CoCoClaNeL, for verifying program correctness.
Limitations:
The performance of HoarePrompt may depend on the performance of LLM.
Generalization performance for complex programs or requirements requires further research.
The size and diversity of the CoCoClaNeL dataset may be limited.
Further research may be needed on setting the k value for k-induction.
👍