Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data

Created by
  • Haebom

Author

Qiongqiong Wang, Hardik Bhupendra Sailor, Tianchi Liu, Wenyu Zhang, Muhammad Huzaifah, Nattadaporn Lertcheva, Shuo Sun, Nancy F. Chen, Jinyang Wu, AiTi Aw

Outline

This paper addresses the inability of large-scale speech-understanding language models (Speech-LLMs) to understand non-verbal aspects of speech, which are essential for social and emotional intelligence. To address this, we propose CP-Bench, a benchmark that evaluates contextual paralinguistic reasoning, which integrates verbal content with non-verbal cues such as emotion and prosody. CP-Bench consists of two question answering (QA) datasets that require both linguistic and empathic understanding. We evaluate state-of-the-art Speech-LLMs, including open-source and closed-source models, and perform a comprehensive analysis of various question types. For the top two models, we analyze the impact of temperature tuning. Our results reveal the limitations of existing evaluations and provide insights for building more context-aware and emotionally intelligent speech-responsive LLMs.

Takeaways, Limitations

Takeaways:
CP-Bench, a new benchmark for assessing situational nonverbal reasoning abilities, is presented.
Provides a comprehensive analysis of the nonverbal comprehension abilities of cutting-edge Speech-LLMs.
Identifying the Limitations of the existing Speech-LLM assessment and suggesting directions for improvement
Analysis of the impact of temperature tuning on Speech-LLM performance
Providing insights for developing more context-aware and emotionally intelligent Speech-LLM
Limitations:
Limited size and diversity of CP-Bench datasets (lack of specific mention of dataset size or diversity)
Limitations of the models being evaluated (lack of detailed information on the types and number of models included in the evaluation)
Lack of analysis of adjustments to parameters other than temperature adjustment.
Lack of performance verification in real-world applications
👍