This paper presents a zero-shot speech conversion technique that converts speaker components of input speech into target speaker components without additional training, using only a single reference utterance. Previous research focused on extracting high-quality content representations and removing speaker information using self-supervised learning features and K-means quantization. However, this process often results in the loss of fine-grained phonetic and prosodic information, which degrades intelligibility and prosodic retention. This paper presents a novel method that effectively separates speaker information from phonetic and prosodic information by taking temporal characteristics into account using quantization residuals. Using only K-means quantization and linear projection, we achieve simple yet effective separation without complex structures or explicit supervised learning, and enable high-quality speech conversion using only reconstruction loss. Experimental results demonstrate that the proposed model outperforms existing methods in both subjective and objective metrics, improving intelligibility, speaker similarity, and prosodic retention.