Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

Created by
  • Haebom

Author

Abdul Basit, Minghao Shao, Muhammad Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

Outline

This paper evaluates the quantum computing code generation capabilities of large-scale language models (LLMs). Based on real-world problems from the Quantum Hackathon (QHack), we present QHackBench, a new benchmark dataset, to benchmark the performance of LLMs against PennyLane-based quantum code generation. We compare and evaluate basic prompting and search-augmented generation (RAG) methods, using a structured evaluation framework that assesses functional correctness, syntactic validity, and execution success rates on problems of varying difficulty. We demonstrate that the RAG-based model, using the extended PennyLane dataset, produces results comparable to the basic prompting method even on complex quantum algorithms. Furthermore, we propose a multi-agent evaluation pipeline that iteratively corrects incorrect solutions, further enhancing the execution success rate. By making the QHackBench dataset, evaluation framework, and experimental results public, we aim to stimulate research in AI-based quantum programming.

Takeaways, Limitations

Takeaways:
We provide a foundation for systematically evaluating LLM's quantum code generation capabilities through a new benchmark dataset called QHackBench.
We show that the quantum code generation performance of LLM can be improved by utilizing the RAG technique.
We suggest the possibility of improving the accuracy of code generation through a multi-agent evaluation pipeline.
It is expected that AI-based quantum programming research will be activated through public datasets and frameworks.
Limitations:
Because the benchmark dataset is limited to QHack problems, generalizability to other quantum programming environments or problem types may be limited.
Because the evaluation metrics are limited to functional correctness, syntactic validity, and execution success rate, other important aspects such as code efficiency or degree of optimization may not be taken into account.
There is a lack of concrete analysis on the performance improvement of multi-agent evaluation pipelines.
👍