Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

I-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents

Created by
  • Haebom

Author

Anupam Purwar, Aditya Choudhary

Outline

This paper conducts experiments and optimizes a low-latency, end-to-end speech-to-speech communication model for real-time conversational applications. By analyzing the essential components of a speech-to-speech (V-2-V) system, including automatic speech recognition (ASR), text-to-speech (TTS), and dialogue management, we identify optimization strategies to reduce processing time while maintaining high-quality interaction. Specifically, we find that the TTS component, which generates lifelike speech with natural pauses and emotions, has the greatest impact on Real Time Factor (RTF). The V-2-V architecture, leveraging CSM1b, utilizes both audio and text from previous conversations to understand the tone and context of the conversation and generates context-sensitive speech. Furthermore, we explored optimizations for Residual Vector Quantization (RVQ) iterations in the TTS decoder, but these resulted in poor speech quality. Experimental results show that reducing the number of RVQ iterations and the number of codebooks used in Mimi are the most important optimizations in a CSM-based V-2-V implementation.

Takeaways, Limitations

We found that the TTS component has the greatest impact on RTF.
Demonstrated that the CSM1b-based V-2-V architecture can understand context and generate appropriate speech.
Reducing the number of RVQ repetitions comes at the expense of voice quality.
We confirm that reducing the number of RVQ iterations and the number of codebooks are important optimization factors in the CSM-based V-2-V implementation.
👍