This paper introduces CodeJudgeBench, a novel benchmark that utilizes large-scale language models (LLMs) as code evaluators (LLM-as-a-Judge). CodeJudgeBench is designed to evaluate the performance of LLM-as-a-Judge models across three coding tasks: code generation, code modification, and unit test generation. Comprehensively benchmarking 26 LLM-as-a-Judge models, we find that state-of-the-art models with reasoning capabilities significantly outperform non-reasoning models. Even relatively small reasoning models, such as Qwen3-8B, outperformed specially trained LLM-as-a-Judge models with sizes up to 70B by up to 70%. However, all models exhibited significant randomness in judging coding tasks, and in pairwise comparison tasks, even changing the order of response presentation significantly affected accuracy. Furthermore, we observed that the performance of LLM-as-a-Judge models varied when judging code and unit tests written by different LLMs. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Finally, we studied the optimal prompting strategy for LLM-as-a-Judge, finding that pairwise comparisons outperformed single-score judgments, and that retaining comments and inferences from the entire, unprocessed LLM response improved judgment performance.