Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

FunAudio-ASR Technical Report

Created by
  • Haebom

Author

Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou

Outline

This paper presents FunAudio-ASR, a large-scale language model (LLM)-based automatic speech recognition (ASR) system. FunAudio-ASR synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance in diverse and complex speech recognition scenarios. It addresses the hallucination problem of existing LLM-based ASR systems and optimizes them to meet real-world application requirements, including streaming capabilities, noise immunity, code switching, and hotword customization. Experimental results demonstrate FunAudio-ASR's effectiveness and robustness in real-world environments, achieving state-of-the-art performance (SOTA) on open-source benchmarks and real-world industry evaluation datasets.

Takeaways, Limitations

Takeaways:
We demonstrate the effectiveness of a novel ASR system that combines large-scale data, large-scale models, LLM integration, and reinforcement learning.
Presenting the possibility of developing a practical ASR system applicable to actual industrial environments.
A proposal to alleviate the hallucination problem of LLM-based ASR systems.
Enhanced functionality required for real-world applications, including streaming, noise immunity, and code switching.
Limitations:
The specifics of the actual industrial evaluation dataset presented in this paper are lacking.
There is a lack of analysis on the relative importance of each factor (large data, large model, LLM integration, reinforcement learning) that contributed to the performance improvement of FunAudio-ASR.
A more comprehensive comparative analysis with other state-of-the-art ASR systems is needed.
👍