This study investigates the reliability of using large-scale language models (LLMs) as automatic evaluators to evaluate the safety of generated content. Using 11 different LLM evaluation models, we evaluated three key aspects: self-consistency, consistency with human judgment, and susceptibility to input artifacts such as apologetic expressions or long-winded expressions. The results revealed that LLM evaluator bias can distort the final judgment of safe content sources, thereby undermining the validity of comparative evaluations. In particular, apologetic expressions alone can distort evaluator preferences by up to 98%. Contrary to expectations that larger models would be more robust, smaller models were also found to be more resistant to certain artifacts. To mitigate the robustness issue of LLM evaluators, we investigated jury-based evaluation that aggregates the decisions of multiple models. This method improves robustness and improves consistency with human judgment, but artifact sensitivity persists even with the best jury configuration. These results emphasize the urgent need for diverse and artifact-resistant methodologies for reliable safety evaluation.