In this paper, we present a novel method to apply pre-trained large-scale text-to-image diffusion models to image-to-image conversion in a plug-and-play manner. It achieves high-quality and versatile text-based image-to-image conversion without model training, fine-tuning, or online optimization. For text-to-image generation using reference images, we decompose the guide elements into various frequency bands of diffusion features in the DCT spectral space, and design a novel frequency-band permutation layer to enable dynamic control of the reference images. We show that the guide elements and intensities of the reference images can be flexibly controlled by adjusting the types and bandwidths of the frequency bands. Experimental results demonstrate that the proposed method outperforms existing methods in terms of image quality, diversity, and controllability of image-to-image conversion. The code is publicly available.