Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Evaluating LLM - Generated Versus Human-Authored Responses in Role-Play Dialogues

Created by
  • Haebom

Author

Dongxu Lu, Johan Jeuring, Albert Gatt

Outline

To address the challenges of evaluating long, knowledge-based role-playing dialogues from large-scale language models (LLMs), this study compared LLM-generated responses with human-authored responses in a multi-turn professional training simulation. Human evaluations (N=38) and automated LLM-as-a-judge evaluations revealed that the quality of LLM-generated responses deteriorated significantly with each turn in terms of naturalness, context retention, and overall quality. In contrast, human-authored responses gradually improved. These human evaluation results were validated by automated LLM-as-a-judge evaluations, where Gemini 2.0 Flash demonstrated strong agreement with human raters in both zero-shot pairwise preference and probabilistic six-shot component evaluations. This study provides a multi-turn benchmark that exposes LLM degradation in knowledge-based role-playing dialogues and presents a validated hybrid evaluation framework for the reliable integration of LLMs in training simulations.

Takeaways, Limitations

The quality of LLM-generated responses degrades over time in multi-turn conversations.
Human-written responses improve over time.
Automated LLM-as-a-judge assessments using Gemini 2.0 Flash match human assessments.
Consideration Needed for Quality Loss When Introducing LLM-Based Training Simulation
The study presents evaluation results for a specific LLM (Gemini 2.0 Flash); performance for other LLMs may vary.
Further research is needed on the generalizability of the evaluation target simulation and evaluation criteria.
👍