Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Created by
  • Haebom

Author

Isik Baran Sandan, Tu Anh Dinh, Jan Niehues

Outline

In this paper, we propose a novel method, "Knockout Assessment", that utilizes large-scale language models (LLMs) as evaluators. To improve the existing LLM evaluation methods that rely on individual evaluations or single-round pairwise comparisons, which lack an understanding of the overall ranking, Knockout Assessment performs evaluations in a tournament manner through repeated pairwise comparisons. Experimental results using three LLMs and two datasets show that Knockout Assessment improves the evaluation accuracy and makes the evaluation of LLMs more consistent with human evaluations, such as improving the average Pearson correlation with expert evaluations by 0.07 in university-level exam grading and machine translation evaluation.

Takeaways, Limitations

Takeaways:
We present a novel method to improve the accuracy of assessment using LLM.
Through repeated pairwise comparisons in a tournament format, LLM gains a better understanding of the overall ranking.
Increases the usability of LLM assessors in a variety of fields, including machine translation and exam grading.
Increase the reliability of LLM assessments by increasing the agreement between human raters and LLM raters.
Limitations:
The dataset size and diversity of the presented experiments may be limited.
There may be dependencies on specific LLMs, and generalizability to other LLMs needs to be verified.
The computational complexity of tournament play may increase.
Further research is needed to determine whether the performance gains from Knockout Assessment are consistent across all types of assessment tasks.
👍