Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

Created by
  • Haebom

Author

Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun

Outline

This paper deals with a native multimodal large language model (MLLM) that restructures a single large language model (LLM) into a speech language model (SLM) capable of generating speech and text. Unlike existing modular and aligned MLLMs, the native MLLM maintains rich paralinguistic features such as sentiment and prosody and generates speech responses directly within the backbone LLM without a separate speech decoder. However, the native MLLM suffers from poor performance and catastrophic forgetting due to the lack of speech-to-text pair data. To address this, this paper proposes DeepTalk, an adaptive modal expert learning framework based on the Mixture of Experts (MoE) architecture. DeepTalk adaptively separates modal experts according to the modal load in the LLM, and each expert undergoes special unimodal training followed by multimodal joint training. As a result, DeepTalk shows only 5.5% performance degradation compared to the original LLM, which is much lower than the 20% or more performance degradation of typical native MLLMs and on par with modular MLLMs. Additionally, the end-to-end conversation latency is maintained within 0.5 seconds, providing smooth voice interactions. The code and model are available at https://github.com/talkking/DeepTalk .

Takeaways, Limitations

Takeaways:
We present a DeepTalk framework that effectively addresses the performance degradation problem of native MLLM.
Achieves significantly lower performance degradation compared to existing native MLLM (5.5% vs. 20% or more).
Achieve modular MLLM level performance and fast response speed of less than 0.5 seconds.
Maintaining the advantages of native MLLM by preserving rich paralinguistic features.
Increasing reproducibility and usability of research through open code and model disclosure.
Limitations:
DeepTalk's performance improvements may be limited to specific datasets or models.
Difficulties in training and deployment due to the complexity of the MoE architecture.
Lack of evaluation of generalization performance across diverse languages and speech environments.
The need for more extensive speech-text pair data.
👍