GLM-Image는 이제 오픈소스입니다

작성자

Jaenoo

작성시각

Jan 14, 2026 3:25 PM

카테고리

Vision
Python

상태

Done

담당자

Jaenoo

참여자

최근활동

Jan 14

Jaenoo

Jan 14, 2026

0) What GLM-Image is good at (context)

GLM-Image는 autoregressive(AR) + diffusion decoder 하이브리드 구조로, "예쁜 그림"뿐 아니라 포스터/PPT/인포그래픽/다이어그램처럼 텍스트·레이아웃·의미 구조가 중요한 이미지에서 강점을 목표로 합니다. (docs.z.ai)

1) Setup (설치)

#옵션 A) 로컬 추론 (Transformers + Diffusers)

GLM-Image GitHub Quick Start는 transformers/diffusers를 source install 하도록 안내합니다. (GitHub)

python -m venv .venv
source .venv/bin/activate   # (Windows) .venv\Scripts\activate

pip install --upgrade pip
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git

로컬 실행 전 체크(현실적인 요구사항)

•

GitHub README 기준, 현재는 런타임 비용이 높아 "80GB 이상 단일 GPU 또는 멀티 GPU"가 필요하다고 명시합니다. (GitHub)
→ 개인 환경에서 GPU가 빡빡하면 API 방식(옵션 C) 를 추천합니다.

#옵션 B) 로컬 서빙(Serving) — SGLang

레포는 sglang serve 기반의 이미지 생성/편집 엔드포인트 예시도 제공합니다. (GitHub)

pip install "sglang[diffusion] @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git

sglang serve --model-path zai-org/GLM-Image

#옵션 C) API로 빠른 PoC (가장 현실적인 개인 플로우)

Z.ai Image API는 glm-image 모델을 POST /paas/v4/images/generations로 호출합니다. (docs.z.ai)

또한 GLM-Image 가이드는 가격을 $0.015 / image로 명시합니다. (docs.z.ai)

2) Inference (추론)

2.1 Text-to-Image (T2I) — 로컬 파이프라인

GitHub README에 나온 GlmImagePipeline 예시는 아래 형태입니다. (GitHub)

import torch
from diffusers.pipelines.glm_image import GlmImagePipeline

pipe = GlmImagePipeline.from_pretrained(
    "zai-org/GLM-Image",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

prompt = "A PPT slide with clear hierarchy. Title: \"Quarterly Growth\" ..."

image = pipe(
    prompt=prompt,
    height=32 * 32,
    width=36 * 32,
    num_inference_steps=50,
    guidance_scale=1.5,
    generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

image.save("output_t2i.png")

해상도 룰(매우 중요)

•

width/height는 32의 배수여야 하며, 아니면 에러가 난다고 GitHub가 명시합니다. (GitHub)

•

Z.ai 문서 기준 커스텀 해상도는 512–2048px 범위 + 각각 32의 배수 조건이 있습니다. (docs.z.ai)

•

추천 해상도 예시도 문서에 나옵니다(예: 1280×1280, 1568×1056 등). (docs.z.ai)

2.2 Image-to-Image (I2I) — 편집/스타일 변경

README 예시처럼 image=[...]를 넣으면 됩니다. (GitHub)

import torch
from PIL import Image
from diffusers.pipelines.glm_image import GlmImagePipeline

pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image", torch_dtype=torch.bfloat16, device_map="cuda")

cond = Image.open("cond.jpg").convert("RGB")
prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator."

out = pipe(
    prompt=prompt,
    image=[cond],
    height=33 * 32,   # 입력 이미지와 같아도 꼭 지정
    width=32 * 32,    # 입력 이미지와 같아도 꼭 지정
    num_inference_steps=50,
    guidance_scale=1.5,
    generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

out.save("output_i2i.png")

2.3 API 호출 (cURL)

문서 예시는 아래와 같습니다. (docs.z.ai)

curl --request POST \
  --url https://api.z.ai/api/paas/v4/images/generations \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "glm-image",
    "prompt": "A poster with clear multi-line text. Title: \"GLM-Image\" ...",
    "size": "1280x1280"
  }'

•

API 응답은 url 형태로 내려오며(즉, 이미지를 별도 다운로드 필요), GLM-Image 가이드에도 이 점이 명시되어 있습니다. (docs.z.ai)

•

API는 quality에 hd/standard 옵션이 있고, glm-image 기본은 hd이며 시간(약 20초 vs 5–10초) 차이를 설명합니다. (docs.z.ai)

3) Prompt Recipes (문서형 이미지에 특화된 프롬프트 패턴)

3.1 Golden rules (텍스트/레이아웃 성공률 올리는 핵심)

보이는 모든 문구는 따옴표로 "exact text" 지정
GLM-Image의 prompt utility(system prompt)도 "이미지에 텍스트가 있으면 모든 텍스트를 완전하게 제시하고, 따옴표로 명확히 표시"를 강하게 요구합니다. (GitHub)

Layout 먼저, Text는 그 다음 (layout-first prompting)
"상단 제목 / 좌측 요약 / 우측 이미지 / 하단 표" 같은 영역을 먼저 고정하면 모델이 구조를 잡기 쉽습니다. (GLM-Image가 포스터·PPT·도식 등 knowledge-intensive 구조 생성을 주요 타깃으로 삼는 맥락과도 맞습니다.) (docs.z.ai)

No extra text를 명시
문서형 생성에서는 "원치 않는 텍스트가 새로 생기는 문제"가 흔합니다. 프롬프트에 "Do not add any additional text beyond the quoted text."를 넣어주세요.

Prompt enhancement(프롬프트 확장)
GitHub README는 더 높은 품질을 위해 GLM-4.7로 prompt를 강화(enhance) 하라고 권장합니다. (GitHub)

3.2 템플릿 4종

(A) Poster 템플릿 (text-heavy)

Design a modern promotional poster with a clean grid layout and strong visual hierarchy.

Top area: big bold title text: "GLM-Image Open Source"
Below title: subtitle text: "Text + Layout + Meaning preserved"
Center: a simple abstract illustration (minimal, not distracting)
Bottom area: three bullet-style blocks with icons:
"Posters with readable text"
"PPT-style slides"
"Logical infographics & diagrams"

Typography: sans-serif, high contrast, crisp edges.
Do not add any additional text beyond the quoted text.

(B) PPT single slide 템플릿 (1-page executive slide)

Create a 16:9 PPT slide with a professional business style.

Header: title text "Project Update"
Left column: section title "Key Metrics" and three lines:
"DAU: 1.2M"
"Retention: 38%"
"Conversion: 4.1%"

Right column: a simple bar chart illustration (no random labels).
Footer: small disclaimer text "Internal draft"

All visible text must match exactly and be enclosed in quotes. No extra text.

(C) Infographic 템플릿 (step-by-step)

Create an infographic with 4 numbered steps in a vertical flow.
Each step is a rounded rectangle connected by arrows.

Step titles:
"Step 1: Collect"
"Step 2: Clean"
"Step 3: Train"
"Step 4: Evaluate"

Add a small caption under each step (one short sentence).
Use a minimal flat design with clear spacing.
All text must be exact and only the quoted text should appear.

(D) Diagram 템플릿 (box-arrow logic)

Draw a clean system diagram on a white background.

Three modules as boxes:
"Input" -> "AR Generator" -> "Diffusion Decoder"

Add small labels near arrows:
"tokens"
"latent refinement"

Add a legend box in bottom-right:
"AR: global structure"
"Decoder: details & text strokes"

No additional text beyond the quoted text.

4) OCR Validation Pipeline (생성 → OCR → 자동 채점 → 재시도)

문서형 이미지에서 "정확한 텍스트"가 목표라면, OCR 기반의 automatic verification이 체감 난이도를 확 낮춰줍니다.

4.1 파이프라인 개념 (Design)

Generate N candidates (different seeds)

OCR로 텍스트를 추출

Normalize(대소문자/공백/특수문자 정책)

기대 문자열(Expected text)과 string similarity / exact match 평가

기준 미달이면 retry(seed 변경, 프롬프트 제약 강화, 텍스트 길이 축소)

참고: GLM-Image 자체도 텍스트 정확도에 초점을 둔 설계(텍스트 스트로크/"forgetting characters" 현상 개선) 맥락을 문서에서 언급합니다. (docs.z.ai)

4.2 Python 예시 (로컬 추론 + PaddleOCR)

아래는 "생성 → OCR → 점수화 → 베스트 1장 저장"의 최소 예시입니다. (개인 프로젝트에서 가장 많이 쓰는 형태)

import re
import torch
from difflib import SequenceMatcher
from diffusers.pipelines.glm_image import GlmImagePipeline
from paddleocr import PaddleOCR

# 1) 기대 텍스트(프롬프트에 넣은 quoted text와 동일해야 함)
EXPECTED = [
    "GLM-Image Open Source",
    "Text + Layout + Meaning preserved",
    "Posters with readable text",
    "PPT-style slides",
    "Logical infographics & diagrams",
]

def normalize(s: str) -> str:
    s = s.lower()
    s = re.sub(r"\s+", " ", s).strip()
    return s

def score_ocr(extracted: str, expected_list: list[str]) -> float:
    extracted_n = normalize(extracted)
    # expected 문장들이 얼마나 포함/유사한지 평균 점수
    scores = []
    for t in expected_list:
        t_n = normalize(t)
        # 포함 여부 + 유사도 혼합(간단 버전)
        contain = 1.0 if t_n in extracted_n else 0.0
        sim = SequenceMatcher(None, t_n, extracted_n).ratio()
        scores.append(0.7 * contain + 0.3 * sim)
    return sum(scores) / len(scores)

# 2) OCR 엔진
ocr = PaddleOCR(use_angle_cls=True, lang="en")  # 한글이면 lang="korean" 세팅 검토

# 3) GLM-Image 로컬 파이프
pipe = GlmImagePipeline.from_pretrained(
    "zai-org/GLM-Image",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

prompt = """
Design a modern promotional poster with a clean grid layout and strong visual hierarchy.
Top area: big bold title text: "GLM-Image Open Source"
Below title: subtitle text: "Text + Layout + Meaning preserved"
Bottom area: three blocks with icons:
"Posters with readable text"
"PPT-style slides"
"Logical infographics & diagrams"
Typography: sans-serif, high contrast, crisp edges.
Do not add any additional text beyond the quoted text.
""".strip()

best = {"score": -1, "seed": None, "image": None}

for seed in [1, 2, 3, 4, 5]:
    img = pipe(
        prompt=prompt,
        width=1280,
        height=1280,
        num_inference_steps=50,
        guidance_scale=1.5,
        generator=torch.Generator(device="cuda").manual_seed(seed),
    ).images[0]

    # 4) OCR 실행
    ocr_result = ocr.ocr(img, cls=True)
    extracted = " ".join([line[1][0] for block in ocr_result for line in block])

    # 5) 채점
    s = score_ocr(extracted, EXPECTED)
    if s > best["score"]:
        best.update({"score": s, "seed": seed, "image": img})

print("BEST:", best["score"], "seed:", best["seed"])
best["image"].save("best_poster.png")

Retry 튜닝 팁 (when OCR score is low)

•

텍스트 길이 줄이기: 멀티라인이 길어질수록 오탈자 확률이 증가

•

"No extra text" 강제: 원치 않는 글자 유입 방지

•

텍스트를 블록별로 분리해 위치를 명확히: "Bottom area / Right column"처럼

•

seed sweep는 가장 싸고 강력한 개선책

•

로컬이 아니라 API면, 이미지 생성 비용은 들지만 개인 환경에서는 GPU보다 현실적일 때가 많습니다. (가격: $0.015/image) (docs.z.ai)

5) Troubleshooting (자주 터지는 포인트)

•

Resolution error: width/height가 32의 배수인지부터 확인하세요. (GitHub)

•

VRAM 부족 / 느림: GitHub가 "80GB+ 또는 multi-GPU"를 언급할 정도로 무겁습니다. 개인이면 API로 우회하는 게 깔끔할 때가 많아요. (GitHub)

•

품질 vs 속도: API는 quality=hd/standard 차이가 있고, hd는 더 오래 걸리지만 더 풍부한 결과를 목표로 합니다. (docs.z.ai)

python -m venv .venv source .venv/bin/activate # (Windows) .venv\Scripts\activate pip install --upgrade pip pip install git+https://github.com/huggingface/transformers.git pip install git+https://github.com/huggingface/diffusers.git

pip install "sglang[diffusion] @ git+https://github.com/sgl-project/sglang.git#subdirectory=python" pip install git+https://github.com/huggingface/transformers.git pip install git+https://github.com/huggingface/diffusers.git sglang serve --model-path zai-org/GLM-Image

import torch from diffusers.pipelines.glm_image import GlmImagePipeline pipe = GlmImagePipeline.from_pretrained( "zai-org/GLM-Image", torch_dtype=torch.bfloat16, device_map="cuda", ) prompt = "A PPT slide with clear hierarchy. Title: \"Quarterly Growth\" ..." image = pipe( prompt=prompt, height=32 * 32, width=36 * 32, num_inference_steps=50, guidance_scale=1.5, generator=torch.Generator(device="cuda").manual_seed(42), ).images[0] image.save("output_t2i.png")

import torch from PIL import Image from diffusers.pipelines.glm_image import GlmImagePipeline pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image", torch_dtype=torch.bfloat16, device_map="cuda") cond = Image.open("cond.jpg").convert("RGB") prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator." out = pipe( prompt=prompt, image=[cond], height=33 * 32, # 입력 이미지와 같아도 꼭 지정 width=32 * 32, # 입력 이미지와 같아도 꼭 지정 num_inference_steps=50, guidance_scale=1.5, generator=torch.Generator(device="cuda").manual_seed(42), ).images[0] out.save("output_i2i.png")

curl --request POST \ --url https://api.z.ai/api/paas/v4/images/generations \ --header 'Authorization: Bearer <token>' \ --header 'Content-Type: application/json' \ --data '{ "model": "glm-image", "prompt": "A poster with clear multi-line text. Title: \"GLM-Image\" ...", "size": "1280x1280" }'

Design a modern promotional poster with a clean grid layout and strong visual hierarchy. Top area: big bold title text: "GLM-Image Open Source" Below title: subtitle text: "Text + Layout + Meaning preserved" Center: a simple abstract illustration (minimal, not distracting) Bottom area: three bullet-style blocks with icons: "Posters with readable text" "PPT-style slides" "Logical infographics & diagrams" Typography: sans-serif, high contrast, crisp edges. Do not add any additional text beyond the quoted text.

Create a 16:9 PPT slide with a professional business style. Header: title text "Project Update" Left column: section title "Key Metrics" and three lines: "DAU: 1.2M" "Retention: 38%" "Conversion: 4.1%" Right column: a simple bar chart illustration (no random labels). Footer: small disclaimer text "Internal draft" All visible text must match exactly and be enclosed in quotes. No extra text.

Create an infographic with 4 numbered steps in a vertical flow. Each step is a rounded rectangle connected by arrows. Step titles: "Step 1: Collect" "Step 2: Clean" "Step 3: Train" "Step 4: Evaluate" Add a small caption under each step (one short sentence). Use a minimal flat design with clear spacing. All text must be exact and only the quoted text should appear.

Draw a clean system diagram on a white background. Three modules as boxes: "Input" -> "AR Generator" -> "Diffusion Decoder" Add small labels near arrows: "tokens" "latent refinement" Add a legend box in bottom-right: "AR: global structure" "Decoder: details & text strokes" No additional text beyond the quoted text.

import re import torch from difflib import SequenceMatcher from diffusers.pipelines.glm_image import GlmImagePipeline from paddleocr import PaddleOCR # 1) 기대 텍스트(프롬프트에 넣은 quoted text와 동일해야 함) EXPECTED = [ "GLM-Image Open Source", "Text + Layout + Meaning preserved", "Posters with readable text", "PPT-style slides", "Logical infographics & diagrams", ] def normalize(s: str) -> str: s = s.lower() s = re.sub(r"\s+", " ", s).strip() return s def score_ocr(extracted: str, expected_list: list[str]) -> float: extracted_n = normalize(extracted) # expected 문장들이 얼마나 포함/유사한지 평균 점수 scores = [] for t in expected_list: t_n = normalize(t) # 포함 여부 + 유사도 혼합(간단 버전) contain = 1.0 if t_n in extracted_n else 0.0 sim = SequenceMatcher(None, t_n, extracted_n).ratio() scores.append(0.7 * contain + 0.3 * sim) return sum(scores) / len(scores) # 2) OCR 엔진 ocr = PaddleOCR(use_angle_cls=True, lang="en") # 한글이면 lang="korean" 세팅 검토 # 3) GLM-Image 로컬 파이프 pipe = GlmImagePipeline.from_pretrained( "zai-org/GLM-Image", torch_dtype=torch.bfloat16, device_map="cuda", ) prompt = """ Design a modern promotional poster with a clean grid layout and strong visual hierarchy. Top area: big bold title text: "GLM-Image Open Source" Below title: subtitle text: "Text + Layout + Meaning preserved" Bottom area: three blocks with icons: "Posters with readable text" "PPT-style slides" "Logical infographics & diagrams" Typography: sans-serif, high contrast, crisp edges. Do not add any additional text beyond the quoted text. """.strip() best = {"score": -1, "seed": None, "image": None} for seed in [1, 2, 3, 4, 5]: img = pipe( prompt=prompt, width=1280, height=1280, num_inference_steps=50, guidance_scale=1.5, generator=torch.Generator(device="cuda").manual_seed(seed), ).images[0] # 4) OCR 실행 ocr_result = ocr.ocr(img, cls=True) extracted = " ".join([line[1][0] for block in ocr_result for line in block]) # 5) 채점 s = score_ocr(extracted, EXPECTED) if s > best["score"]: best.update({"score": s, "seed": seed, "image": img}) print("BEST:", best["score"], "seed:", best["seed"]) best["image"].save("best_poster.png")