This paper conducts experiments and optimizes a low-latency, end-to-end speech-to-speech communication model for real-time conversational applications. By analyzing the essential components of a speech-to-speech (V-2-V) system, including automatic speech recognition (ASR), text-to-speech (TTS), and dialogue management, we identify optimization strategies to reduce processing time while maintaining high-quality interaction. Specifically, we find that the TTS component, which generates lifelike speech with natural pauses and emotions, has the greatest impact on Real Time Factor (RTF). The V-2-V architecture, leveraging CSM1b, utilizes both audio and text from previous conversations to understand the tone and context of the conversation and generates context-sensitive speech. Furthermore, we explored optimizations for Residual Vector Quantization (RVQ) iterations in the TTS decoder, but these resulted in poor speech quality. Experimental results show that reducing the number of RVQ iterations and the number of codebooks used in Mimi are the most important optimizations in a CSM-based V-2-V implementation.