Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Created by
  • Haebom

Author

Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao

Outline

This paper addresses the challenges of evaluating the conversational performance of end-to-end speech conversation models, such as GPT-4o-audio, and proposes WavReward, a novel evaluation model to address these challenges. WavReward can assess the IQ and EQ of speech conversation systems based on an audio language model, and leverages reinforcement learning algorithms to build specialized evaluators through multi-sample feedback. Specifically, it was trained using the ChatReward-30K preference dataset, a dataset of 30,000 preferences, covering a variety of scenarios, including text-based chat, directed chat with acoustic properties, and implicit chat. Experimental results show that WavReward outperforms existing state-of-the-art evaluation models across multiple speech conversation scenarios, significantly improving the objective accuracy of Qwen2.5-Omni from 53.4% to 91.5%, and also outperformed subjective A/B tests by 83%. Through ablation studies, we validated the necessity of each component of WavReward.

Takeaways, Limitations

Takeaways:
A new methodology is presented to effectively evaluate the conversational ability of voice conversation models.
Presenting a new standard for evaluating voice conversation models with improved accuracy and reliability compared to existing models.
Comprehensive evaluation considering both IQ and EQ is possible through an evaluation model based on an audio language model.
Contributing to research advancement through the release of the large-scale preference dataset ChatReward-30K.
Limitations:
Currently, the code and data are not publicly available on Github (will be released after paper acceptance).
Lack of detailed description of the composition and quality of the ChatReward-30K dataset.
Generalization performance verification is needed for various voice conversation models and scenarios.
Further research is needed to increase the reliability of subjective assessments.
👍