Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Created by
  • Haebom

Author

Fabian David Schmidt, Ivan Vuli c, Goran Glava\v{s}, David Ifeoluwa Adelani

Outline

This paper presents Fleurs-SLU, a multilingual SLU benchmark for speech understanding (SLU) in low-resource languages. Fleurs-SLU contains 692 hours of speech data for subject utterance classification in 102 languages and 944 hours of speech data for multiple-choice question answering through listening comprehension in 92 languages. We extensively evaluate an end-to-end speech classification model, a cascaded system combining speech-to-text transcription and LLM-based classification, and a multimodal speech-LLM on Fleurs-SLU. Experimental results show that while the cascaded system is more robust in multilingual SLU, a well-trained speech encoder demonstrates competitive performance in subject speech classification. The closed-loop speech-LLM matches or surpasses the performance of the cascaded system. Furthermore, we observe a strong correlation among robust multilingual ASR, effective speech-to-text translation, and robust multilingual SLU, demonstrating the mutual benefits of acoustic and semantic speech representations.

Takeaways, Limitations

Takeaways:
We present Fleurs-SLU, a new benchmark for multilingual SLU research, including low-resource languages.
We demonstrate the strengths of cascaded systems in multilingual SLUs and the competitiveness of pre-trained speech encoders and closed-loop speech-LLMs.
Uncovering the interconnections between robust multilingual ASR, effective speech-to-text translation, and multilingual SLU.
Limitations:
Fleurs-SLU focuses on specific languages and tasks, requiring further research on generalizability.
Lack of detailed analysis of closed-loop voice-LLM performance comparisons.
Further comprehensive performance analysis is needed for various low-resource languages.
👍