Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Created by
  • Haebom

Author

Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perli c, Ekaterina Borisova, Markarit Vartampetian

Outline

Although LLM-as-a-judge, a method for evaluating text generated using a large-scale language model (LLM), is a common evaluation method in Natural Language Generation (NLG), it is primarily used as a quantitative tool with a numerical score as its primary output. In this paper, we propose an LLM-based evaluation method (LLM-as-a-qualitative-judge), which uses LLMs to generate structured reports on common problem types in NLG system output. This approach aims to provide developers with meaningful insights for improving a given NLG system and consists of two main steps: instance-by-instance problem analysis and clustering of the discovered problems using an intuitive stacking algorithm. We also present a strategy for evaluating the proposed approach with ~300 problem annotations from instances in 12 NLG datasets. The results demonstrate that instance-by-instance problems output by LLM-as-a-qualitative-judge match human-annotated problems in two-thirds of cases, and that LLM-as-a-qualitative-judge can generate error type reports similar to those generated by human annotators. Furthermore, case studies demonstrate that the use of LLM-as-a-qualitative-judge can substantially improve the performance of NLG systems.

Takeaways, Limitations

We present a novel evaluation method that uses LLM to generate structured reports on the problem types in NLG system output.
We demonstrate that LLM-as-a-qualitative-judge produces results similar to human annotators and can contribute to improving the performance of NLG systems.
Establishing a strategy and dataset for evaluating the proposed approach.
Code and data disclosure.
The fact that instance-specific issues only match human annotations in 2/3 of cases suggests room for improvement.
The performance of LLM-as-a-qualitative-judge may depend on the quality of the LLM used.
👍