Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention

Created by
  • Haebom

Author

HyeYoung Lee, Muhammad Nadeem

Outline

In this paper, we propose a novel 1D-CNN-based speech emotion recognition (SER) framework to capture subtle emotional changes and improve generalization performance on diverse datasets. We use Mel-Frequency Cepstral Coefficients (MFCCs) as features and leverage a 1D Convolutional Neural Network (CNN) architecture with data augmentation techniques and channel and spatial attention mechanisms to improve the performance of the model. Experimental results on various datasets (SAVEE, RAVDESS, CREMA-D, TESS, EMO-DB, EMOVO) demonstrate that the proposed method achieves high accuracy, outperforming the existing state-of-the-art performance. (SAVEE 97.49%, RAVDESS 99.23%, CREMA-D 89.31%, TESS 99.82%, EMO-DB 99.53%, EMOVO 96.39%) This suggests that the integration of advanced deep learning techniques can significantly improve the generalization performance on various datasets, and has potential for applying SER to assistive technologies and human-computer interaction in real environments.

Takeaways, Limitations

Takeaways:
We demonstrate the effectiveness of a 1D-CNN-based SER framework utilizing data augmentation and attention mechanisms.
Achieving state-of-the-art performance on diverse datasets.
Presenting the applicability of SER in the field of assistive technology and human-computer interaction in real environments.
Limitations:
Despite the high accuracy on certain datasets, the relatively low accuracy on the CREMA-D dataset suggests room for future improvement.
Additional analysis and validation of the generalization performance of the method presented in the paper is needed.
Further experiments considering different language and cultural backgrounds are needed.
👍