This paper evaluates the quantum computing code generation capabilities of large-scale language models (LLMs). Based on real-world problems from the Quantum Hackathon (QHack), we present QHackBench, a new benchmark dataset, to benchmark the performance of LLMs against PennyLane-based quantum code generation. We compare and evaluate basic prompting and search-augmented generation (RAG) methods, using a structured evaluation framework that assesses functional correctness, syntactic validity, and execution success rates on problems of varying difficulty. We demonstrate that the RAG-based model, using the extended PennyLane dataset, produces results comparable to the basic prompting method even on complex quantum algorithms. Furthermore, we propose a multi-agent evaluation pipeline that iteratively corrects incorrect solutions, further enhancing the execution success rate. By making the QHackBench dataset, evaluation framework, and experimental results public, we aim to stimulate research in AI-based quantum programming.