This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
Created by
Haebom
Author
Yuhao Du, Qianwei Huang, Guo Zhu, Zhanchen Dai, Shunian Chen, Qiming Zhu, Le Pan, Minghao Chen, Yuhao Zhang, Li Zhou, Benyou Wang, Haizhou Li
Outline
This paper presents MTalk-Bench, a new benchmark for evaluating the performance of multi-turn speech-to-speech (S2S) large-scale language models (LLMs). MTalk-Bench consists of nine realistic scenarios encompassing three core dimensions—semantic information, vocal information, and ambient noise—and target tasks designed to assess specific abilities, such as reasoning. Evaluation is conducted using a combination of arena-based (pairwise comparison) and rubric-based (absolute scoring) methods, providing both relative and absolute evaluations. Both model and human outputs are evaluated by human raters and LLMs. Experimental results show that S2S LLMs excel at processing semantic information but struggle with recognizing vocal information and ambient noise. They also demonstrate a tendency to increase response length to restore consistency, but at a reduced efficiency. Furthermore, modality-aware and task-specific design outperform simple scaling. Finally, we analyze the reliability and limitations of the proposed evaluation framework.
Takeaways, Limitations
•
Takeaways:
◦
Introducing MTalk-Bench, a new benchmark for multi-session S2S LLM evaluations.
◦
S2S LLM has excellent semantic information processing ability, but it lacks the ability to process voice information and ambient noise.
◦
We found that increasing response length contributes to improved consistency but reduces efficiency.
◦
Emphasize the importance of modality awareness and task-specific design.
◦
Presenting the possibility of complementary assessment between the arena method and the rubric method.
•
Limitations:
◦
Arena and rubric methods only provide consistent results when performance differences are significant.
◦
When using LLM as an evaluator, it matches human evaluators only when there are clear differences or explicit criteria.
◦
LLM evaluators exhibit position and length bias, and nonverbal assessments are only reliable when accompanied by textual annotations.
◦
The need for a more robust evaluation framework that takes into account speech recognition capabilities is raised.