To address the challenges of evaluating long, knowledge-based role-playing dialogues from large-scale language models (LLMs), this study compared LLM-generated responses with human-authored responses in a multi-turn professional training simulation. Human evaluations (N=38) and automated LLM-as-a-judge evaluations revealed that the quality of LLM-generated responses deteriorated significantly with each turn in terms of naturalness, context retention, and overall quality. In contrast, human-authored responses gradually improved. These human evaluation results were validated by automated LLM-as-a-judge evaluations, where Gemini 2.0 Flash demonstrated strong agreement with human raters in both zero-shot pairwise preference and probabilistic six-shot component evaluations. This study provides a multi-turn benchmark that exposes LLM degradation in knowledge-based role-playing dialogues and presents a validated hybrid evaluation framework for the reliable integration of LLMs in training simulations.