[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation

Created by
  • Haebom

Author

Xiang Gao, Jiaying Liu

Outline

In this paper, we present a novel method to apply pre-trained large-scale text-to-image diffusion models to image-to-image conversion in a plug-and-play manner. It achieves high-quality and versatile text-based image-to-image conversion without model training, fine-tuning, or online optimization. For text-to-image generation using reference images, we decompose the guide elements into various frequency bands of diffusion features in the DCT spectral space, and design a novel frequency-band permutation layer to enable dynamic control of the reference images. We show that the guide elements and intensities of the reference images can be flexibly controlled by adjusting the types and bandwidths of the frequency bands. Experimental results demonstrate that the proposed method outperforms existing methods in terms of image quality, diversity, and controllability of image-to-image conversion. The code is publicly available.

Takeaways, Limitations

Takeaways:
Enables efficient and high-quality text-based image-to-image translation by leveraging pre-trained large-scale text-to-image models.
Improved convenience through plug-and-play application without model training.
Flexible control of the guide elements and intensity of the reference image through frequency band adjustment.
Provides superior image quality, variety, and controllability over conventional methods.
Reproducibility and extensibility achieved through open code.
Limitations:
The performance of the proposed method may depend on the performance of the pre-trained text-to-image model.
Potential performance degradation for certain types of images or text prompts.
There may be limitations in frequency decomposition methods using DCT spectral space.
Additional evaluation of generalization performance for various image transformation tasks is needed.
👍