# DALLE 는 어떻게 이미지를 생성할까 ?

![https://openai.com/dall-e-3](https://upload.cafenono.com/image/slashpagePost/20240417/173146_FaxpjBVPwwOGvoaTNF?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/22c7f99c-e162-4cba-ad17-40c63765bcf8/Group-9492-1536x1536.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/22c7f99c-e162-4cba-ad17-40c63765bcf8/Group-9492-1536x1536.png)

[](https://openai.com/dall-e-3)

이번 글에서는 텍스트로 이미지를 생성할 수 있는(text to image) 모델인 DALLE에 대해 살펴보도록 하겠습니다.

## Image Generation

이미지 생성 분야의 경우 2014년 부터 GAN이라는 모델을 기반으로 빠르게 발전해왔습니다. 현재는 사람의 그림과 AI가 생성한 그림을 구분하는 게 불가능에 가까울 정도로 고도화 된 상황입니다. 2022년에는 미드저니라는 이미지 생성 AI 모델의 작품을 이용하여 그림 대회에서 우승한 사건이 있었습니다. 

![https://www.researchgate.net/figure/Progress-of-image-generation-made-by-different-GAN-models-over-the-years_fig1_353838206](https://upload.cafenono.com/image/slashpagePost/20240417/173210_xZNJamgGhTcOF0dZGd?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/ed404443-f08b-4307-a917-8ee9d7d2f69e/Progress-of-image-generation-made-by-different-GAN-models-over-the-years.jpg](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/ed404443-f08b-4307-a917-8ee9d7d2f69e/Progress-of-image-generation-made-by-different-GAN-models-over-the-years.jpg)

![https://www.seoul.co.kr/news/international/2022/09/04/20220904500075](https://upload.cafenono.com/image/slashpagePost/20240417/173222_CPtDl5XGnLVhJtri0O?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/3e93ace3-45f8-424b-b03f-d801341cc49f/oeuvre-art-generee-IA-MidJourney-remporte-1er-prix-Colorado-State-Fair-768x384.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/3e93ace3-45f8-424b-b03f-d801341cc49f/oeuvre-art-generee-IA-MidJourney-remporte-1er-prix-Colorado-State-Fair-768x384.png)

[](https://www.seoul.co.kr/news/international/2022/09/04/20220904500075)

DALLE를 살펴보기 앞서, GenAI에 대해 먼저 살펴보도록 하겠습니다. 

# Gen AI

## Representation

일반적으로는 딥러닝 모델은 데이터의 정답을 기반으로 지도학습을 수행합니다. 개와 고양이를 구분하는 분류 문제를 푸는 경우, 이 과정에서 모델은 서로 다른 클래스의 구분하는 함수를 근사합니다. 오버피팅, 언더피팅이 되지 않고 잘 학습된 모델은 새로운 데이터에 대해서도 올바르게 예측을 할 수 있습니다. 이때 모델을 보고 데이터를 잘 representation 했다고 말합니다. AlexNet은 MNIST 손글씨 데이터셋 대해 잘 분류를 해냈기 때문에 MNIST 손글씨 데이터셋에 대해 좋은 representation 가진 모델이라 할 수 있습니다.

![https://spotintelligence.com/2023/12/11/representation-learning/](https://upload.cafenono.com/image/slashpagePost/20240417/173253_jsWD40FOMvs883yCd3?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/5b2c2af4-51a5-4c11-9a55-770951333f1c/representation-learning-1024x576.webp](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/5b2c2af4-51a5-4c11-9a55-770951333f1c/representation-learning-1024x576.webp)

[](https://spotintelligence.com/2023/12/11/representation-learning/)

## Generation

하지만 이런 Representation을 잘한다고 해서 Generation 잘 하지는 않습니다. 영어를 잘 읽고 듣는 사람이, 쓰거나 말하지 못하는 경우와 마찬가지 입니다. 그런 이유로 Generation을 잘 할 수 있도록 모델을 설계하는 분야가 GenAI로써 별도로 존재합니다. GenAI 모델은 representaion Model 의 도움을 받아 생성을 수행할 수도 있으며 단독으로 생성을 학습할 수 도 있습니다. 

![Image](https://upload.cafenono.com/image/slashpagePost/20240417/173320_mJROygIxPkAj9yUxoW?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/358f7c04-3d38-4cdd-9c12-d508b35df704/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/358f7c04-3d38-4cdd-9c12-d508b35df704/Untitled.png)

GenAI 모델링 방식은 크게 네 가지로 구분 될 수 있습니다.

1. **auto-regressive**

2. **VAE**

3. **GAN**

4. **Diffusion**

# Gen AI 모델링 방식

## **1. auto-regressive**

**auto-regressive의 기본적인 아이디어는 t 시점 에 일어난 일을 예측하는데 제일 좋은 예측자은 t-1 시점에서 일어난 일이라는 것**입니다. 이러한 방식을 사용하는 대표적인 모델이 GPT와 같은 Causal Language Model 입니다. 

언어는 순차적인 토큰의 나열로 이해할 수 있습니다. 먼저 사용된 토큰들이 t-1 시점의 ****조건(condition) 으로 들어가서 t 시점의 토큰을 예측하는 방식으로 모델의 학습을 진행합니다. Causal Language Model 들은 언어의 조건부 확률(Conditional Probability)를 학습하며 일반적인 언어 생성 능력을 배울 수 있습니다.

자연어 처리의 경우 auto-regressive 방식이 주류이며 Transformer의 Decoder 부분을 쌓아올리는 방식으로 주로 구현이됩니다. 

![https://velog.io/@nawnoes/Transformer-기반의-자연어처리-모델](https://upload.cafenono.com/image/slashpagePost/20240417/173335_PIN7BN2Ac5FpIJAlzw?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/d65aa000-657f-4157-9c10-085f9439323b/images_nawnoes_post_17a78de6-6d65-4a76-9563-afaf05995afc_image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/d65aa000-657f-4157-9c10-085f9439323b/images_nawnoes_post_17a78de6-6d65-4a76-9563-afaf05995afc_image.png)

[](https://velog.io/@nawnoes/Transformer-%25EA%25B8%25B0%25EB%25B0%2598%25EC%259D%2598-%25EC%259E%2590%25EC%2597%25B0%25EC%2596%25B4%25EC%25B2%2598%25EB%25A6%25AC-%25EB%25AA%25A8%25EB%258D%25B8)

## **2. VAE**

**[[1312.6114] Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114)**

VAE(Variational ****Auto-Encoder)는 현재는 많이 쓰이지는 않지만 생성 AI의 시초격으로 볼 수 있는 모델링 방법입니다. 

![https://huidea.tistory.com/296](https://upload.cafenono.com/image/slashpagePost/20240417/173350_Nl4Xf5mf4HMFqGQ0K6?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/479bbc25-a7fc-4b91-abce-ce04cfd79e26/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/479bbc25-a7fc-4b91-abce-ce04cfd79e26/Untitled.png)

[](https://huidea.tistory.com/296)

### Auto-Encoder

VAE에 앞서 Auto-Encoder 를 살펴봐야 합니다. Auto-Encoder의 주요 목적은 입력 데이터의 **효율적인 representation**을 학습하는 것입니다. input값을 latent vector (잠재 벡터)로 변환하는 encoder와 그것을 원래대로 복원하는 decoder로 구성됩니다. **latent vector 라는 것은 피부색, 키, 성별과 같은 input 값의 대표할 수 있는 feature 변수들로 이루어진 벡터입니다.**  압축하고 복원하는 과정에서 Auto-Encoder 는 input 값의 효율적인 representation 을 얻어 낼 수 있습니다.

![https://www.v7labs.com/blog/autoencoders-guide](https://upload.cafenono.com/image/slashpagePost/20240417/173407_cbOa1NWBI4L5BwQM8X?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/5aa76dc9-b099-4b63-9318-0792fbcdec54/627d121bd4fd200d73814c11_60bcd0b7b750bae1a953d61d_autoencoder.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/5aa76dc9-b099-4b63-9318-0792fbcdec54/627d121bd4fd200d73814c11_60bcd0b7b750bae1a953d61d_autoencoder.png)

[](https://www.v7labs.com/blog/autoencoders-guide)

### VAE

VAE의 Auto-Encoder의 기본적인 방식을 따르지만 몇 가지 차이점이 있습니다.  

**먼저 input 값을 인코딩할 때 latent space 의 각 feature 마다 표준 정규 분포(가우시안 분포)로 매핑한다는 점입니다**. Auto-Encoder latent vector의 각 차원은 image 를 대표하는 feature 들이며 고유 스칼라였습니다. **VAE 는 각 feature 의 평균, 분산으로 매핑합니다.** 

![https://velog.io/@yunyoseob/Gaussian-Distribution-정규분포](https://upload.cafenono.com/image/slashpagePost/20240417/173423_Klp0dtd3aHe7N9ajrp?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/1595a672-013d-4648-8e61-e9e9634780e7/image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/1595a672-013d-4648-8e61-e9e9634780e7/image.png)

[](https://velog.io/@yunyoseob/Gaussian-Distribution-%25EC%25A0%2595%25EA%25B7%259C%25EB%25B6%2584%25ED%258F%25AC)

이렇게 만들어진 latent vector 의 각 차원에서 random sampling 을 수행하여 얻은 벡터를 가지고 decoder 에 전달하여 원본 이미지를 복원합니다. 

이러한 방식으로 VAE 는 latent vector의 확률 분포와 그것의 랜덤 샘플을 가지고 학습하기 때문에 **Auto-Encoder와 달리 gereration 능력을 학습 할 수 있습니다.**

## **3. GAN**

**[[1406.2661] Generative Adversarial Networks](https://arxiv.org/abs/1406.2661)**

GAN(Generative Adversarial Networks)은 이미지 생성 분야에서 주류를 차지해왔습니다. 한국말로 번역하면 “적대적 생성 신경망” 입니다. Generator 와 Discriminator 라는 GAN의 두 개의 신경망이 존재하는데, 이것은 마치 위조범과 경찰관에 비유할 수 있습니다. 

GAN의 손실함수를 최소화 하기 위해선 Generator(위조범)은 Disciminator(경찰관)을 속이도록 가짜 이미지를 진짜 이미지처럼 보이도록 생성 해야 하며, discriminator(경찰관)은 생성/실제 이미지를 제대로 판단할 수 있어야 합니다.

학습이 진행되며 고도화된 GAN 의 Generator 는 실제 인간이 가짜 이미지인지 판단할 수 없을 정도로 높은 품질의 이미지를 생성할 수 있게 됩니다. 

![https://baechu-story.tistory.com/12](https://upload.cafenono.com/image/slashpagePost/20240417/173807_5BSJdIdqsmb9p3jDqE?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/2e910421-c73b-430b-9169-71fab8a5a567/img.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/2e910421-c73b-430b-9169-71fab8a5a567/img.png)

![https://velog.io/@hyebbly/Deep-Learning-Loss-정리-1-GAN-loss](https://upload.cafenono.com/image/slashpagePost/20240417/173821_aoKtngvIprDZOnwTuj?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/1234d6bc-d060-45e6-a5f6-686590e4cccf/images_hyebbly_post_a6e590a2-92a6-4bde-8e10-70daf3103849_image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/1234d6bc-d060-45e6-a5f6-686590e4cccf/images_hyebbly_post_a6e590a2-92a6-4bde-8e10-70daf3103849_image.png)

[](https://velog.io/@hyebbly/Deep-Learning-Loss-%25EC%25A0%2595%25EB%25A6%25AC-1-GAN-loss)

## 4. Diffusion

**[[2006.11239] Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)**

Diffusion 방법의 경우 noise를 제거할 수 있는 능력이 있다면, 생성할 수 있는 능력을 가질 수 있음을 이용하는 방법입니다. forward process 는 원본 이미지에 매 시간마다 noise를 추가하여 t 시점에는 원본 이미지를 완전한 노이즈로 변환합니다.  이 때 각 픽셀에 추가되는 노이즈 값은 가우시안 분포에서 무작위로 샘플링됩니다.

**Diffusion 모델이 학습을 진행하는 것은 noise를 제거하는 reverse process 에서 입니다.** Diffusion 모델의 손실 함수를 살펴보면 t 시점의 **실제 추가된 noise 와 예측된 noise 의 오차가 최소화되도록 설계한 것**을 확인할 수 있습니다. 2021년 이후로 Diffusion 기반 모델들이 현재 Image Generation  분야 에서 주류를 차지하고 있습니다. 

![https://xoft.tistory.com/32](https://upload.cafenono.com/image/slashpagePost/20240417/173840_R7wYZu1GiQ9p8N0pd7?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/861ecb9f-1d93-408e-ae1b-455c446169b6/img1.daumcdn.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/861ecb9f-1d93-408e-ae1b-455c446169b6/img1.daumcdn.png)

![https://xoft.tistory.com/32](https://upload.cafenono.com/image/slashpagePost/20240417/173851_Z7GkS9gbTbfLMpLpsu?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/6a0a28e3-4082-40db-a1ca-bcc7eca4f67c/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/6a0a28e3-4082-40db-a1ca-bcc7eca4f67c/Untitled.png)

![Image](https://upload.cafenono.com/image/slashpagePost/20240417/173905_b3onWNV4YxeCUXleDj?q=75&s=1280x180&t=outside&f=webp)

# Conditional Generation

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/f1ed94f5-b079-4613-a067-6000d5194b3b/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/f1ed94f5-b079-4613-a067-6000d5194b3b/Untitled.png)

이렇게 해서 간단하게 Gen AI를 모델링하는 방식에 대해서 살펴보았습니다. 하지만 생성된 결과물은 랜덤하게 만들어지는 것이 아니라 **사용자의 의도에 맞게 만들어내는 것이 중요합니다**. 모델이 특정한 조건 하에서 생성물을 만들어내도록 하는 많은 시도들이 있었으며 가장 쉬운 방법으로써 **text prompt 조건을 이용해서 image 생성 하는 방식**입니다. 

DALLE 는 대표적인 text-to-image 모델입니다. 

![https://mvje.tistory.com/134](https://upload.cafenono.com/image/slashpagePost/20240417/173924_FseEcCg83Kd8jF5wRF?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/381c9446-edd1-4b99-a885-d1a3d789070f/img1.daumcdn.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/381c9446-edd1-4b99-a885-d1a3d789070f/img1.daumcdn.png)

[
](https://mvje.tistory.com/134)

[
](https://mvje.tistory.com/134)

DALLE는 총 3번의 버전 변경이 있었습니다.

- _[[DALLE-1] Zero-Shot Text-to-Image Generation, 2021](https://arxiv.org/abs/2102.12092)_

    - 학습 방식 : VAE, Autoregressive

- _[[DALLE-2] Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022](https://arxiv.org/abs/2204.06125)_

    - 학습 방식 : Diffusion

- _[[DALLE-3] Improving Image Generation with Better Captions, 2023](https://cdn.openai.com/papers/dall-e-3.pdf)_

    - 학습 방식 :  DALLE-2 와 같음. 데이터 셋에 대한 품질 개선

순서대로 살펴보도록 하겠습니다. 

# DALLE-1 (2021)

_[[DALLE-1] Zero-Shot Text-to-Image Generation, 2021](https://arxiv.org/abs/2102.12092)_

DALLE-1 의 경우  VAE 와 transformer를 활용한 auto-regressive 모델링을 사용했습니다. 약 2억 5천만 장의 이미지-텍스트 쌍으로 학습시켰으며 논문 이름에서 볼 수 있듯이 아무런 예시 없이 이미지를 생성하는 Zero Shot 에서 우수한 성능을 보여줬습니다. 

## 1stage - dVAE

1stage 에서는 input Image 를 dVAE(**discrete VAE)** 를 이용해 32_32_8192 크기의 latent vetor 로 압축합니다. VAE와 달리 확률 분포가 아닌 임베딩 차원으로 매핑한다는 점에서 차이점이 있습니다. dVAE 는  약간의 트릭을 이용하여 랜덤 샘플링을 진행(생략)하고 decoder는 이것으로 이미지를 복원하며 학습을 진행합니다. 

![Image](https://upload.cafenono.com/image/slashpagePost/20240417/174012_QNT4LbZtrJeJ0geB39?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/40ad6ec7-c379-4ef3-b371-2377a64760de/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/40ad6ec7-c379-4ef3-b371-2377a64760de/Untitled.png)

## 2stage - Autoregressive (transformer)

일단 dVAE encoder를 통과한 32_32_8192 크기의 이미지 토큰 벡터를  (32*32, 1) = (1024, ) 크기로 만듭니다. 

그리고 text prompt를 tokenized 벡터로 변환합니다. (max 256, )

**위 두 벡터를 concat 하여 transformer decoder에 입력합니다.** 

text token 벡터는 그대로 두고, image token 들을 마스킹하여 순차적으로 auto-regressive 하게 예측해나갑니다. 

이 과정에서 transformer decoder 가 text prompt 가 주어졌을때 image token 을 생성할 수 있는 능력을 배울 수 있습니다. 

![https://housekdk.gitbook.io/ml/ml/computer-vision-transformer-based/zero-shot-text-to-image-generation-dall-e](https://upload.cafenono.com/image/slashpagePost/20240417/174024_asc1tP7fY5ssJwIC22?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/b85530fd-56ad-4811-bf9a-bf298b2721ec/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/b85530fd-56ad-4811-bf9a-bf298b2721ec/Untitled.png)

[](https://housekdk.gitbook.io/ml/ml/computer-vision-transformer-based/zero-shot-text-to-image-generation-dall-e)

**추론 시에는 사용자가 text prompt 를 입력하면 transformer decoder 에서 예측된 image token 들을 dVAE 의 디코더가 복원하는 과정을 거쳐 이미지를 생성하게 됩니다.** 

# DALLE-2 (2022)

_[[DALLE-2] Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022](https://arxiv.org/abs/2204.06125)_

DALLE-2는 전작보다 화질이 4배나 상승했으며, 더욱 정교해졌습니다. 이전 버전과 마찬가지로 2 Stage로 진행합니다.

![[DALLE-2] Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022](https://upload.cafenono.com/image/slashpagePost/20240417/174043_f6KHG0kt3ZPwq21Fxn?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/3028a67e-107e-4acd-8944-680ffc142022/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/3028a67e-107e-4acd-8944-680ffc142022/Untitled.png)

## 1 Stage - prior : CLIP 모델을 통해  텍스트의  image embedding 을 얻음.

DALLE-2 를 이해하기 앞서 **CLIP**(**C**ontrastive **L**anguage-**I**mage **P**re-training model) 이라는 Representation 모델을 알아야 합니다.

학습하는 방식은 다음과 같습니다. 2억 5천만개의 text-image pair 데이터셋을 준비합니다. Text는 transformer encoder를 통과시키고, image 는 **ViT**(**Vi**sion **T**ransformer)를 통과시켜 임베딩을 얻습니다.  

일치하는 pair의 코사인 유사도 가 높게 끔 학습을 시키고, 그 외의 일치하지 않는 쌍들은 유사도가 낮게 나오도록 학습을 진행합니다.  이렇게 **학습된 CLIP 을 통해 매칭되는 image 와 text 를 양방향으로 얻어낼 수 있습니다.**

![[2103.00020] Learning Transferable Visual Models From Natural Language Supervision](https://upload.cafenono.com/image/slashpagePost/20240417/174108_gTjrxBwdOsns5MLoGH?q=75&s=1280x180&t=outside&f=webp)

[](https://arxiv.org/abs/2103.00020)

## 2stage - decoder

2 stage 에서는 diffusion 모델을 사용합니다. 먼저 text prompt 를 CLIP 모델을 통과시켜 image embedding 을 얻고 이것에 text prompt를 토큰화한 벡터를 concat하여 condition으로 준비합니다. 

이것을 diffusion의 input 으로 넣고 t 스텝의  noise 를 추가합니다. noise 를 제거하여 원래 이미지를 복원할 수 있도록 아래와 같은 손실함수를 이용하여 Diffusion 모델을 학습시킵니다. 

![Image](https://upload.cafenono.com/image/slashpagePost/20240417/174145_cgr30XZN8HGAcPvf6y?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/130d7351-acaa-437f-b3f8-33208f667db0/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/130d7351-acaa-437f-b3f8-33208f667db0/Untitled.png)

Diffusion model 을 통해 만들어진 이미지는 64*64 의 저화질 이기 때문에 또 다른 Diffusion 모델을 통과 시켜 고화질 1024 * 1024의 이미지를 얻어 냅니다.

![https://ffighting.net/deep-learning-paper-review/diffusion-model/dalle2/](https://upload.cafenono.com/image/slashpagePost/20240417/174155_70iIQB38KgUFZQGQ9t?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/a26a58d1-00e0-4ad9-95b4-3040e02ee85e/Untitled.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/a26a58d1-00e0-4ad9-95b4-3040e02ee85e/Untitled.png)

# DALLE-3 (2023)

_[[DALLE-3] Improving Image Generation with Better Captions, 2023](https://cdn.openai.com/papers/dall-e-3.pdf)_

DALLE-3는 이전 DALLE 2 버전과 동일한 모델 구조를 유지하되 GPT-4V 를 이용하여 일관되고 누락없는 합성 캡션을 얻어냈습니다. 합성 캡션 95%와 실제 캡션 5%를 사용하여 DALLE 를 재학습시켜 성능을 끌어올렸습니다.  

![[2023.10] Improving Image Generation with Better Captions](https://upload.cafenono.com/image/slashpagePost/20240417/174236_D1AYLcWzdblz1SU9tS?q=75&s=1280x180&t=outside&f=webp)

[](https://cdn.openai.com/papers/dall-e-3.pdf)

# 최근 동향

지금까지 해서 text to image 모델인 DALLE 에 대해 살펴봤습니다. 이 외에도 대표적인 text-to-image 모델은 미드저니와 Stable Diffusion 이 있습니다. 모두 Diffusion 기반 모델들입니다.

![https://www.marktechpost.com/2022/11/14/how-do-dall%C2%B7e-2-stable-diffusion-and-midjourney-work/](https://upload.cafenono.com/image/slashpagePost/20240417/174254_6LgU0CvDWhaHmgylQf?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/c4269984-183d-469a-93ed-e6d3fc3f43ac/pasted_image_0.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/c4269984-183d-469a-93ed-e6d3fc3f43ac/pasted_image_0.png)

DALLE의 한계점은 무엇일까요?  매우 상세한 이미지를 생성할 수 없다는 점입니다. prompt 를 조금만 수정하더라도 전체적인 모양이 다르니 생성 결과물에 대한 일관성 유지가 어렵습니다. 때문에 특정 동작이나 스케치 이미지를 기반으로 고정되게 결과물을 만들려는 시도들이 있으며 ControlNet 이라는 모델이 대표적 입니다.

### ControlNet

**[Adding Conditional Control to Text-to-Image Diffusion Models, 2023](https://arxiv.org/abs/2302.05543)**

ControlNet은 특정 형태의 동작이라던가 스케치 이미지를 text prompt 와 함께 넣어주면 특정 동작을 유지한 채로 이미지를 생성 해낼 수 있습니다.

![https://www.internetmap.kr/entry/Stable-Diffusion-ControlNet1](https://upload.cafenono.com/image/slashpagePost/20240417/174332_DcD6tZKVvGxjUYICnq?q=75&s=1280x180&t=outside&f=webp)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/d63525bb-903e-4151-a775-3768510cde5b/img1.daumcdn.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/d63525bb-903e-4151-a775-3768510cde5b/img1.daumcdn.png)

[](https://www.internetmap.kr/entry/Stable-Diffusion-ControlNet1)

### Sora

**[Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models, 2024](https://arxiv.org/abs/2402.17177)**

3D, Video 영역에서도 많은 모델들이 나오고 있습니다. 2024년 2월 OpenAI 에서 text to video 모델 Sora를 발표했습니다.  SORA 의 경우 autoregressive (transformer) 와 diffusion 을 결합한 생성 AI 모델이며 많은 분야에서 파급력을 가질 것으로 예상됩니다. 

![https://www.hani.co.kr/arti/economy/it/1128910.html](https://upload.cafenono.com/image/slashpagePost/20240417/174405_7abW7I9TGMaigv7bjX)

![https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/ea33b72e-8139-4296-921a-41a0714abec3/1117083246573596.gif](https://prod-files-secure.s3.us-west-2.amazonaws.com/3305a447-52a0-4d83-ba75-81c361e8b6b8/ea33b72e-8139-4296-921a-41a0714abec3/1117083246573596.gif)

[](https://www.hani.co.kr/arti/economy/it/1128910.html)

이상으로 마치도록 하겠습니다. 

# Reference

- [https://arxiv.org/abs/2204.06125](https://arxiv.org/abs/2204.06125)

- [https://ffighting.net/deep-learning-paper-review/diffusion-model/dalle2/](https://ffighting.net/deep-learning-paper-review/diffusion-model/dalle2/)

- [https://namu.wiki/w/DALL·E](https://namu.wiki/w/DALL%25C2%25B7E)

- [https://namu.wiki/w/CLIP 모델?from=clip 모델](https://namu.wiki/w/CLIP%2520%25EB%25AA%25A8%25EB%258D%25B8?from=clip%2520%25EB%25AA%25A8%25EB%258D%25B8)

- [https://ffighting.net/deep-learning-paper-review/diffusion-model/dalle2/](https://ffighting.net/deep-learning-paper-review/diffusion-model/dalle2/)

- [https://www.youtube.com/watch?v=vZdEGcLU_8U&t=112s&ab_channel=모두의연구소](https://www.youtube.com/watch?v=vZdEGcLU_8U&t=112s&ab_channel=%25EB%25AA%25A8%25EB%2591%2590%25EC%259D%2598%25EC%2597%25B0%25EA%25B5%25AC%25EC%2586%258C)

- [https://medium.com/humanscape-tech/paper-review-vae-ac918509a9ba](https://medium.com/humanscape-tech/paper-review-vae-ac918509a9ba)

- [https://eehoeskrap.tistory.com/727](https://eehoeskrap.tistory.com/727)

- [https://eehoeskrap.tistory.com/752](https://eehoeskrap.tistory.com/752)

- **[[2102.12092] Zero-Shot Text-to-Image Generation](https://arxiv.org/abs/2102.12092)**

- **[[2204.06125] Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)**

- **[[2023.10] Improving Image Generation with Better Captions](https://cdn.openai.com/papers/dall-e-3.pdf)**

- [https://www.hani.co.kr/arti/economy/it/1128910.html](https://www.hani.co.kr/arti/economy/it/1128910.html)

- [https://process-mining.tistory.com/161](https://process-mining.tistory.com/161)

- [https://wikidocs.net/152474](https://wikidocs.net/152474)

- [https://littlefoxdiary.tistory.com/74](https://littlefoxdiary.tistory.com/74)

For the site tree, see the [root Markdown](https://slashpage.com/kpmg-lighthouse.md).