Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

Created by
  • Haebom

Author

Hongyu Chen, Seraphina Goldfarb-Tarrant

Outline

This study investigates the reliability of using large-scale language models (LLMs) as automatic evaluators to evaluate the safety of generated content. Using 11 different LLM evaluation models, we evaluated three key aspects: self-consistency, consistency with human judgment, and susceptibility to input artifacts such as apologetic expressions or long-winded expressions. The results revealed that LLM evaluator bias can distort the final judgment of safe content sources, thereby undermining the validity of comparative evaluations. In particular, apologetic expressions alone can distort evaluator preferences by up to 98%. Contrary to expectations that larger models would be more robust, smaller models were also found to be more resistant to certain artifacts. To mitigate the robustness issue of LLM evaluators, we investigated jury-based evaluation that aggregates the decisions of multiple models. This method improves robustness and improves consistency with human judgment, but artifact sensitivity persists even with the best jury configuration. These results emphasize the urgent need for diverse and artifact-resistant methodologies for reliable safety evaluation.

Takeaways, Limitations

Takeaways:
We reveal that LLM evaluator bias can seriously distort safety assessment results.
We found that the correlation between LLM size and robustness was inconsistent.
Although jury-based evaluation improves robustness and agreement with human judgment, it does not completely address the issue of artifact susceptibility.
Emphasizes the need for diverse and artifact-resistant safety assessment methodologies.
Limitations:
The type and number of LLM models used in research may be limited.
The type and variety of artifacts may not be sufficient.
Further research is needed on the optimal configuration of jury-based evaluations.
There may be a lack of consideration for the subjectivity of human judgment.
👍