Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Created by
  • Haebom

Author

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan

Outline

This paper introduces CodeJudgeBench, a novel benchmark that utilizes large-scale language models (LLMs) as code evaluators (LLM-as-a-Judge). CodeJudgeBench is designed to evaluate the performance of LLM-as-a-Judge models across three coding tasks: code generation, code modification, and unit test generation. Comprehensively benchmarking 26 LLM-as-a-Judge models, we find that state-of-the-art models with reasoning capabilities significantly outperform non-reasoning models. Even relatively small reasoning models, such as Qwen3-8B, outperformed specially trained LLM-as-a-Judge models with sizes up to 70B by up to 70%. However, all models exhibited significant randomness in judging coding tasks, and in pairwise comparison tasks, even changing the order of response presentation significantly affected accuracy. Furthermore, we observed that the performance of LLM-as-a-Judge models varied when judging code and unit tests written by different LLMs. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Finally, we studied the optimal prompting strategy for LLM-as-a-Judge, finding that pairwise comparisons outperformed single-score judgments, and that retaining comments and inferences from the entire, unprocessed LLM response improved judgment performance.

Takeaways, Limitations

Takeaways:
CodeJudgeBench provides a standard benchmark for evaluating the performance of LLM-as-a-Judge models.
LLMs with critical thinking skills perform better on code evaluation tasks.
Even relatively small models can outperform larger ones.
We found that a prompting strategy that included pairwise comparisons and annotation and inference information was effective.
Limitations:
All LLM-as-a-Judge models still exhibit significant randomness.
The order in which responses are presented can significantly affect the judgment results.
There is a lack of consistency in the evaluation results for code generated by different LLMs.
Concerns are raised about the reliability and consistency of the LLM-as-a-Judge.
👍