Generative AI · Multimodal

Image Generation Models

Diffusion models, GANs, and autoregressive approaches — architecture, sampling, fine-tuning, and evaluation

3 main methods
6 sections
8 tools
Contents
  1. How diffusion works
  2. Architecture overview
  3. Sampling algorithms
  4. Fine-tuning techniques
  5. Quality metrics
  6. Tools & frameworks
  7. References
01 — Foundation

How Diffusion Works

Diffusion models learn to denoise. During training, noise is gradually added to real images until they become pure noise — the forward process. The model then learns to reverse this, predicting the noise at each step — the reverse process.

Forward Process (Training Setup): Start with a real image, add Gaussian noise step by step. After many steps (typically 1000), only noise remains. Reverse Process (Inference): Start with pure noise, predict and remove noise iteratively. Each prediction step brings you closer to a sample from the learned distribution.

Key Concepts

DDPM (Denoising Diffusion Probabilistic Models): The foundational approach from Ho et al. (2020). Predicts the noise at each timestep using a UNet conditioned on timestep embeddings.

Score Matching: Instead of predicting noise, predict the gradient (score) of the log probability. Mathematically equivalent but sometimes more stable.

💡 Why noise prediction works: The diffusion model learns a mapping from noise at timestep t back to the original image distribution. This is provably equivalent to maximum likelihood training on the original data.

DDIM (Denoising Diffusion Implicit Models): Accelerates sampling by skipping timesteps. Instead of 1000 steps, use 50–100. Trade-off: slightly lower quality but 10–20× speedup.

02 — Structure

Architecture Overview

Image generation models combine several components: a VAE encoder/decoder for latent space, a UNet denoiser, and a text encoder (CLIP) for conditioning. Most modern models use latent diffusion — operating in compressed latent space rather than pixel space for efficiency.

Core Components

UNet Denoiser: The main architecture. Encoder-decoder with skip connections. Takes noisy latent + timestep embedding + optional text conditioning, outputs predicted noise.

VAE (Variational Autoencoder): Encodes images into low-dimensional latents. Diffusion happens in this compressed space (4–8× compression). Decodes latents back to pixel space at the end.

CLIP Text Encoder: Converts text prompts to embeddings. Cross-attention layers in UNet use these embeddings to guide generation toward the prompt.

Timestep Embedding: Injected into UNet to condition the denoiser on which step of the reverse process it is. Typically sinusoidal positional encodings.

⚠️ Latent vs Pixel-space: Stable Diffusion operates in VAE latent space (64×64 latents for 512×512 images). Much faster but requires the VAE to be well-trained. DALL-E operates in pixel space — higher quality but much slower.
03 — Inference

Sampling Algorithms Comparison

Different schedulers control which timesteps to sample and how to update predictions. The choice dramatically affects speed vs quality.

AlgorithmSteps TypicalSpeedQualityStabilityBest For
DDPM1000SlowHighestVery stableResearch, publication quality
DDIM50–100FastGoodStableGeneral purpose, interactive
DPM-Solver20–50Very fastExcellentVery stableProduction inference
PLMS25–50Very fastVery goodStableFast generation
Euler30–100FastVery goodMediumCreative exploration

DPM-Solver recommended: Matches DDPM quality in 20–30 steps. Solves the ODE formulation of diffusion analytically. Currently the best speed-quality tradeoff.

04 — Customization

Fine-Tuning Techniques

Pre-trained models like Stable Diffusion can be adapted to new styles, objects, or concepts without retraining from scratch. Several lightweight approaches exist.

Popular Approaches

DreamBooth: Fine-tune on 3–5 images of a subject. Optimize a special token + UNet weights. Preserves general knowledge while learning the new concept. ~10 minutes on single GPU.

LoRA (Low-Rank Adaptation): Add trainable low-rank matrices to UNet layers. Dramatically reduces parameters (2–10%). Faster training, smaller checkpoints, composable.

Textual Inversion: Don't fine-tune the model. Instead, optimize a special text embedding that encodes a concept. Smallest checkpoint size, but less flexible.

Kohya Workflow: Community standard for LoRA training. YAML config-driven, supports gradient accumulation, mixed precision, multiple concepts.

Example: LoRA Fine-Tuning with Diffusers

from diffusers import StableDiffusionPipeline, DDPMScheduler from peft import LoraConfig, get_peft_model import torch # Load base model pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5") unet = pipe.unet # Apply LoRA to UNet lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["to_k", "to_v", "to_q"], lora_dropout=0.05, ) unet = get_peft_model(unet, lora_config) # Fine-tune on custom dataset optimizer = torch.optim.AdamW(unet.parameters(), lr=1e-4) for batch in train_loader: images, prompts = batch loss = training_step(unet, images, prompts) loss.backward() optimizer.step() optimizer.zero_grad()
💡 LoRA is portable: A LoRA checkpoint is only 10–50MB. Stack multiple LoRAs in inference for composability. Easier to distribute and version than full fine-tunes.
Python · ControlNet for layout-guided image generation
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers import UniPCMultistepScheduler
import torch
from PIL import Image
import numpy as np

def generate_with_canny_control(
    prompt: str,
    control_image: Image.Image,  # edge-detected source image
    strength: float = 0.8
) -> Image.Image:
    """Generate image guided by edge map using ControlNet."""
    # Load ControlNet (Canny edge conditioning)
    controlnet = ControlNetModel.from_pretrained(
        "lllyasviel/sd-controlnet-canny",
        torch_dtype=torch.float16
    )
    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        controlnet=controlnet,
        torch_dtype=torch.float16,
        safety_checker=None
    ).to("cuda")
    pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
    pipe.enable_model_cpu_offload()  # save VRAM

    result = pipe(
        prompt=prompt,
        image=control_image,
        num_inference_steps=30,
        guidance_scale=7.5,
        controlnet_conditioning_scale=strength
    ).images[0]
    return result

# Typical use case: maintain object layout while changing style
# control_image = canny_edge_detect(source_photo)
# output = generate_with_canny_control(
#     "A Van Gogh painting of a city street, starry night style",
#     control_image=control_image
# )
05 — Evaluation

Quality Metrics

Image generation quality is hard to measure automatically. Several metrics exist; most correlate imperfectly with human perception.

MetricMeaningRangeHigher/LowerProsCons
FID
(Fréchet Inception Distance)
Distance between feature distributions of real vs generated0–∞Lower is betterFast, correlates with human judgmentDepends on Inception features, doesn't measure diversity
IS
(Inception Score)
Sharpness + diversity of generated images0–1000Higher is betterFast, no reference data neededBiased toward Inception-like images, unstable
CLIP ScoreSimilarity between image and text prompt0–1Higher is betterMeasures text alignment directlyCLIP has its own biases
LPIPS
(Learned Perceptual Image Patch Similarity)
Perceptual distance using deep features0–1Lower is betterBetter correlation with human perceptionComputationally expensive
Human EvaluationDirect rating or pairwise comparisonVariableHigher is betterGround truthExpensive, time-consuming

Best practice: Use FID for fast iterations, CLIP Score for prompt alignment, and human evaluation for final quality gates. Combine multiple metrics.

06 — Ecosystem

Tools & Frameworks

Model
Stable Diffusion
Open-source latent diffusion. 1.5, XL, Turbo variants. Foundation for most fine-tuning.
Model
DALL-E 3
Closed-source, pixel-space. Highest quality. Expensive API but excellent prompt understanding.
Model
Midjourney
Proprietary, Discord-based. Artistic quality, strong community. Not open-source.
Model
Flux
High-quality transformer-based diffusion. Faster than XL, competitive with DALL-E 3.
Framework
ComfyUI
Node-based UI for composable workflows. LoRA stacking, advanced samplers, extensible.
Framework
Kohya
LoRA training trainer. YAML config, DreamBooth, LoRA merging. Community standard.
Library
Diffusers
Hugging Face library. Multi-model support, inference optimizations, LoRA integration.
Framework
InvokeAI
Web UI + backend. Model management, LoRA support, image upscaling integration.
07 — Practice

Practical API Usage & Prompt Engineering

Getting good results from text-to-image models requires understanding how prompt structure affects output. Effective image prompts describe subject, style, lighting, composition, and quality modifiers explicitly. Negative prompts (in models that support them) specify what to exclude — "blurry, low resolution, distorted faces" — and often improve quality as much as the positive prompt itself.

For production use cases, the key choices are: model (quality vs cost vs speed), resolution (512×512 for drafts, 1024×1024 for finals), steps (20–30 usually sufficient), guidance scale (7–12 for prompt adherence), and seed (fix for reproducibility). Always generate multiple candidates and filter — a rejection rate of 50–70% for final production use is normal.

Python · DALL-E 3 and Stable Diffusion pipeline comparison
import requests, base64
from openai import OpenAI
from pathlib import Path

client = OpenAI()

def dalle3_generate(prompt: str, size: str = "1024x1024",
                    quality: str = "standard") -> str:
    """Generate image with DALL-E 3, return URL."""
    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size=size,
        quality=quality,  # "standard" or "hd"
        n=1
    )
    return response.data[0].url

def stable_diffusion_generate(prompt: str, negative_prompt: str = "",
                               steps: int = 30, guidance: float = 7.5) -> bytes:
    """Generate with local SD via diffusers."""
    from diffusers import StableDiffusionPipeline
    import torch
    pipe = StableDiffusionPipeline.from_pretrained(
        "stabilityai/stable-diffusion-2-1",
        torch_dtype=torch.float16
    ).to("cuda")
    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt or "blurry, low quality, distorted",
        num_inference_steps=steps,
        guidance_scale=guidance,
        width=1024, height=1024
    ).images[0]
    return image

# DALL-E 3: best quality, simple API, higher cost
url = dalle3_generate(
    "A photorealistic 3D render of a neural network as a glowing blue web, "
    "dark background, dramatic lighting, 8K, hyperdetailed"
)
print(f"DALL-E 3 URL: {url}")
07 — Further Reading

References

Academic Papers
Documentation & Tools
Practitioner Writing