[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

Created by
  • Haebom

Author

Samuel Lavoie, Michael Noukhovitch, Aaron Courville

Outline

This paper argues that the success of diffusion models is largely due to input conditioning. Accordingly, we investigate the representations used to condition diffusion models, with the view that an ideal representation should improve sample fidelity, be easy to generate, and be configurable to allow for the generation of out-of-training samples. We introduce discrete latent codes (DLCs), derived from simple compound embeddings trained with self-supervised learning objectives. Unlike standard continuous image embeddings, DLCs are discrete token sequences. They are easy to generate, and their configurability allows for sampling new images beyond the training distribution. DLC-trained diffusion models achieve improved generation fidelity, establishing a new state-of-the-art in unconditional image generation on ImageNet. We also show that constructing DLCs enables image generators to generate out-of-distribution samples that consistently combine the meaning of images in a variety of ways. Finally, we demonstrate how DLCs enable text-to-image generation by leveraging large pre-trained language models. We efficiently fine-tune text diffusion language models to generate DLCs that generate new samples outside the training distribution of the image generator.

Takeaways, Limitations

Takeaways:
Using discrete latent codes (DLCs), we improve the generative fidelity of diffusion models and achieve new state-of-the-art on ImageNet.
We demonstrate that the composability of DLC enables the generation of new images beyond the training distribution and the combination of images with different meanings.
Combining large-scale pre-trained language models with DLC enables efficient text-to-image generation.
Limitations:
The performance of DLC may be limited to a specific dataset (ImageNet). Generalization performance evaluation on other datasets is needed.
Further research is needed on the computational cost and efficiency of the DLC generation process.
Further analysis is needed on the limits and constraints on the configurability of DLC.
👍