This paper highlights the limitations of existing approaches that utilize acoustic features, such as Mel Spectrograms, to generate high-quality speech signals in speech synthesis (TTS) and voice conversion (VC). Existing approaches utilize a vocoder to convert acoustic features into speech signals and apply adversarial training in the time domain, but upsampling the speech signals incurs significant time and memory overhead. To address this, we propose a Vocoder Projection Feature Discriminator (VPFD) that utilizes vocoder features. Using a pre-trained, fixed vocoder feature extractor and a single upsampling step, we demonstrate that VPFD achieves comparable VC performance to the speech discriminator while reducing training time and memory consumption by 9.6x and 11.4x, respectively, through diffusion-based VC distillation experiments.