This paper addresses the problem of Target Speech Extraction (TSE), which involves isolating a specific speaker's speech from a multi-speaker speech mixture. Existing TSE methods primarily utilize discriminative models, which offer high recognition quality. However, they suffer from issues such as artifacts, reduced naturalness, and sensitivity to mismatches between training and testing environments. Generative models, on the other hand, suffer from low recognition quality and intelligibility. In this paper, we propose SoloSpeech, a novel cascaded generation pipeline that integrates compression, extraction, reconstruction, and correction processes. Instead of relying on speaker embeddings, SoloSpeech utilizes conditional information from the latent space of cue audio to align it with the latent space of the mixed audio, thereby avoiding mismatches. Evaluation on the Libri2Mix dataset reveals that SoloSpeech outperforms existing state-of-the-art methods in both intelligibility and quality, and demonstrates excellent generalization performance to non-domain data and real-world settings.