Image Generation Models

Contents

How diffusion works
Architecture overview
Sampling algorithms
Fine-tuning techniques
Quality metrics
Tools & frameworks
References

01 — Foundation

How Diffusion Works

Diffusion models learn to denoise. During training, noise is gradually added to real images until they become pure noise — the forward process. The model then learns to reverse this, predicting the noise at each step — the reverse process.

Forward Process (Training Setup): Start with a real image, add Gaussian noise step by step. After many steps (typically 1000), only noise remains. Reverse Process (Inference): Start with pure noise, predict and remove noise iteratively. Each prediction step brings you closer to a sample from the learned distribution.

Key Concepts

DDPM (Denoising Diffusion Probabilistic Models): The foundational approach from Ho et al. (2020). Predicts the noise at each timestep using a UNet conditioned on timestep embeddings.

Score Matching: Instead of predicting noise, predict the gradient (score) of the log probability. Mathematically equivalent but sometimes more stable.

💡 Why noise prediction works: The diffusion model learns a mapping from noise at timestep t back to the original image distribution. This is provably equivalent to maximum likelihood training on the original data.

DDIM (Denoising Diffusion Implicit Models): Accelerates sampling by skipping timesteps. Instead of 1000 steps, use 50–100. Trade-off: slightly lower quality but 10–20× speedup.

02 — Structure

Architecture Overview

Image generation models combine several components: a VAE encoder/decoder for latent space, a UNet denoiser, and a text encoder (CLIP) for conditioning. Most modern models use latent diffusion — operating in compressed latent space rather than pixel space for efficiency.

Core Components

UNet Denoiser: The main architecture. Encoder-decoder with skip connections. Takes noisy latent + timestep embedding + optional text conditioning, outputs predicted noise.

VAE (Variational Autoencoder): Encodes images into low-dimensional latents. Diffusion happens in this compressed space (4–8× compression). Decodes latents back to pixel space at the end.

CLIP Text Encoder: Converts text prompts to embeddings. Cross-attention layers in UNet use these embeddings to guide generation toward the prompt.

Timestep Embedding: Injected into UNet to condition the denoiser on which step of the reverse process it is. Typically sinusoidal positional encodings.

⚠️ Latent vs Pixel-space: Stable Diffusion operates in VAE latent space (64×64 latents for 512×512 images). Much faster but requires the VAE to be well-trained. DALL-E operates in pixel space — higher quality but much slower.

03 — Inference

Sampling Algorithms Comparison

Different schedulers control which timesteps to sample and how to update predictions. The choice dramatically affects speed vs quality.

Algorithm	Steps Typical	Speed	Quality	Stability	Best For
DDPM	1000	Slow	Highest	Very stable	Research, publication quality
DDIM	50–100	Fast	Good	Stable	General purpose, interactive
DPM-Solver	20–50	Very fast	Excellent	Very stable	Production inference
PLMS	25–50	Very fast	Very good	Stable	Fast generation
Euler	30–100	Fast	Very good	Medium	Creative exploration

DPM-Solver recommended: Matches DDPM quality in 20–30 steps. Solves the ODE formulation of diffusion analytically. Currently the best speed-quality tradeoff.

04 — Customization

Fine-Tuning Techniques

Pre-trained models like Stable Diffusion can be adapted to new styles, objects, or concepts without retraining from scratch. Several lightweight approaches exist.

Popular Approaches

DreamBooth: Fine-tune on 3–5 images of a subject. Optimize a special token + UNet weights. Preserves general knowledge while learning the new concept. ~10 minutes on single GPU.

LoRA (Low-Rank Adaptation): Add trainable low-rank matrices to UNet layers. Dramatically reduces parameters (2–10%). Faster training, smaller checkpoints, composable.

Textual Inversion: Don't fine-tune the model. Instead, optimize a special text embedding that encodes a concept. Smallest checkpoint size, but less flexible.

Kohya Workflow: Community standard for LoRA training. YAML config-driven, supports gradient accumulation, mixed precision, multiple concepts.

Example: LoRA Fine-Tuning with Diffusers

from diffusers import StableDiffusionPipeline, DDPMScheduler from peft import LoraConfig, get_peft_model import torch # Load base model pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5") unet = pipe.unet # Apply LoRA to UNet lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["to_k", "to_v", "to_q"], lora_dropout=0.05, ) unet = get_peft_model(unet, lora_config) # Fine-tune on custom dataset optimizer = torch.optim.AdamW(unet.parameters(), lr=1e-4) for batch in train_loader: images, prompts = batch loss = training_step(unet, images, prompts) loss.backward() optimizer.step() optimizer.zero_grad()

💡 LoRA is portable: A LoRA checkpoint is only 10–50MB. Stack multiple LoRAs in inference for composability. Easier to distribute and version than full fine-tunes.

Python · ControlNet for layout-guided image generation

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers import UniPCMultistepScheduler
import torch
from PIL import Image
import numpy as np

def generate_with_canny_control(
    prompt: str,
    control_image: Image.Image,  # edge-detected source image
    strength: float = 0.8
) -> Image.Image:
    """Generate image guided by edge map using ControlNet."""
    # Load ControlNet (Canny edge conditioning)
    controlnet = ControlNetModel.from_pretrained(
        "lllyasviel/sd-controlnet-canny",
        torch_dtype=torch.float16
    )
    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        controlnet=controlnet,
        torch_dtype=torch.float16,
        safety_checker=None
    ).to("cuda")
    pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
    pipe.enable_model_cpu_offload()  # save VRAM

    result = pipe(
        prompt=prompt,
        image=control_image,
        num_inference_steps=30,
        guidance_scale=7.5,
        controlnet_conditioning_scale=strength
    ).images[0]
    return result

# Typical use case: maintain object layout while changing style
# control_image = canny_edge_detect(source_photo)
# output = generate_with_canny_control(
#     "A Van Gogh painting of a city street, starry night style",
#     control_image=control_image
# )

05 — Evaluation

Quality Metrics

Image generation quality is hard to measure automatically. Several metrics exist; most correlate imperfectly with human perception.

Metric	Meaning	Range	Higher/Lower	Pros	Cons
FID (Fréchet Inception Distance)	Distance between feature distributions of real vs generated	0–∞	Lower is better	Fast, correlates with human judgment	Depends on Inception features, doesn't measure diversity
IS (Inception Score)	Sharpness + diversity of generated images	0–1000	Higher is better	Fast, no reference data needed	Biased toward Inception-like images, unstable
CLIP Score	Similarity between image and text prompt	0–1	Higher is better	Measures text alignment directly	CLIP has its own biases
LPIPS (Learned Perceptual Image Patch Similarity)	Perceptual distance using deep features	0–1	Lower is better	Better correlation with human perception	Computationally expensive
Human Evaluation	Direct rating or pairwise comparison	Variable	Higher is better	Ground truth	Expensive, time-consuming

Best practice: Use FID for fast iterations, CLIP Score for prompt alignment, and human evaluation for final quality gates. Combine multiple metrics.

06 — Ecosystem

Tools & Frameworks

Model

Stable Diffusion

Open-source latent diffusion. 1.5, XL, Turbo variants. Foundation for most fine-tuning.

Model

DALL-E 3

Closed-source, pixel-space. Highest quality. Expensive API but excellent prompt understanding.

Model

Midjourney

Proprietary, Discord-based. Artistic quality, strong community. Not open-source.

Model

Flux

High-quality transformer-based diffusion. Faster than XL, competitive with DALL-E 3.

Framework

ComfyUI

Node-based UI for composable workflows. LoRA stacking, advanced samplers, extensible.

Framework

Kohya

LoRA training trainer. YAML config, DreamBooth, LoRA merging. Community standard.

Library

Diffusers

Hugging Face library. Multi-model support, inference optimizations, LoRA integration.

Framework

InvokeAI

Web UI + backend. Model management, LoRA support, image upscaling integration.

07 — Practice

Practical API Usage & Prompt Engineering

Getting good results from text-to-image models requires understanding how prompt structure affects output. Effective image prompts describe subject, style, lighting, composition, and quality modifiers explicitly. Negative prompts (in models that support them) specify what to exclude — "blurry, low resolution, distorted faces" — and often improve quality as much as the positive prompt itself.

For production use cases, the key choices are: model (quality vs cost vs speed), resolution (512×512 for drafts, 1024×1024 for finals), steps (20–30 usually sufficient), guidance scale (7–12 for prompt adherence), and seed (fix for reproducibility). Always generate multiple candidates and filter — a rejection rate of 50–70% for final production use is normal.

Python · DALL-E 3 and Stable Diffusion pipeline comparison

import requests, base64
from openai import OpenAI
from pathlib import Path

client = OpenAI()

def dalle3_generate(prompt: str, size: str = "1024x1024",
                    quality: str = "standard") -> str:
    """Generate image with DALL-E 3, return URL."""
    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size=size,
        quality=quality,  # "standard" or "hd"
        n=1
    )
    return response.data[0].url

def stable_diffusion_generate(prompt: str, negative_prompt: str = "",
                               steps: int = 30, guidance: float = 7.5) -> bytes:
    """Generate with local SD via diffusers."""
    from diffusers import StableDiffusionPipeline
    import torch
    pipe = StableDiffusionPipeline.from_pretrained(
        "stabilityai/stable-diffusion-2-1",
        torch_dtype=torch.float16
    ).to("cuda")
    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt or "blurry, low quality, distorted",
        num_inference_steps=steps,
        guidance_scale=guidance,
        width=1024, height=1024
    ).images[0]
    return image

# DALL-E 3: best quality, simple API, higher cost
url = dalle3_generate(
    "A photorealistic 3D render of a neural network as a glowing blue web, "
    "dark background, dramatic lighting, 8K, hyperdetailed"
)
print(f"DALL-E 3 URL: {url}")

07 — Further Reading

References

Academic Papers

Paper Ho, J. et al. (2020). Denoising Diffusion Probabilistic Models. arXiv:2006.11239. — arxiv:2006.11239 ↗
Paper Rombach, R. et al. (2021). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752. — arxiv:2112.10752 ↗
Paper Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.14030. — arxiv:2103.14030 ↗
Paper Ruiz, N. et al. (2022). DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. arXiv:2208.12242. — arxiv:2208.12242 ↗

Documentation & Tools

Docs Hugging Face Diffusers. huggingface.co/docs/diffusers ↗
Docs Stability AI — Stable Diffusion. huggingface.co/stabilityai ↗
Guide ComfyUI GitHub. github.com/comfyanonymous/ComfyUI ↗
Guide Kohya LoRA Training. github.com/kohya-ss/sd-scripts ↗

Practitioner Writing

Blog Hugging Face. The Illustrated Stable Diffusion. — huggingface.co/blog ↗
Blog Jay Alammar. Diffusion Models Explained: A Visual Intro. — jalammar.github.io ↗

Image Generation Models

How Diffusion Works

Key Concepts

Architecture Overview

Core Components

Sampling Algorithms Comparison

Fine-Tuning Techniques

Popular Approaches

Example: LoRA Fine-Tuning with Diffusers

Quality Metrics

Tools & Frameworks

Practical API Usage & Prompt Engineering

References

Related concepts