Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Created by
  • Haebom

Author

Xingjian Zhao, Zhe Xu, Qinyuan Cheng, Zhaoye Fei, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Qinghui Gao, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Yiyang Zhang, Donghua Yu, Yang Gao, Xiaogui Yang, Yitian Gong, Yuanfan Xu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu

Outline

MOSS-Speech is a true speech-to-speech large-scale language model that directly understands and generates speech without textual guidance. It combines a modality-based layer-splitting architecture with a fixed pretraining strategy that preserves the inference and knowledge of pretrained text-based LLMs to add speech features. Experimental results show that it achieves state-of-the-art results in spoken question answering, exhibits speech-to-speech performance comparable to text-based systems, and maintains competitive textual performance.

Takeaways, Limitations

Presenting a new paradigm for true speech-to-speech LLM
Presenting the possibility of expressive and efficient end-to-end voice interaction.
Achieving cutting-edge performance in voice question-answering
Speech-to-speech performance similar to text-based systems
Maintain competitive text performance
Eliminates potential bottlenecks by eliminating the use of intermediate text steps
Specific Limitations requires further confirmation in the paper.
👍