Diffusion models, GANs, and autoregressive approaches — architecture, sampling, fine-tuning, and evaluation
Diffusion models learn to denoise. During training, noise is gradually added to real images until they become pure noise — the forward process. The model then learns to reverse this, predicting the noise at each step — the reverse process.
Forward Process (Training Setup): Start with a real image, add Gaussian noise step by step. After many steps (typically 1000), only noise remains. Reverse Process (Inference): Start with pure noise, predict and remove noise iteratively. Each prediction step brings you closer to a sample from the learned distribution.
DDPM (Denoising Diffusion Probabilistic Models): The foundational approach from Ho et al. (2020). Predicts the noise at each timestep using a UNet conditioned on timestep embeddings.
Score Matching: Instead of predicting noise, predict the gradient (score) of the log probability. Mathematically equivalent but sometimes more stable.
DDIM (Denoising Diffusion Implicit Models): Accelerates sampling by skipping timesteps. Instead of 1000 steps, use 50–100. Trade-off: slightly lower quality but 10–20× speedup.
Image generation models combine several components: a VAE encoder/decoder for latent space, a UNet denoiser, and a text encoder (CLIP) for conditioning. Most modern models use latent diffusion — operating in compressed latent space rather than pixel space for efficiency.
UNet Denoiser: The main architecture. Encoder-decoder with skip connections. Takes noisy latent + timestep embedding + optional text conditioning, outputs predicted noise.
VAE (Variational Autoencoder): Encodes images into low-dimensional latents. Diffusion happens in this compressed space (4–8× compression). Decodes latents back to pixel space at the end.
CLIP Text Encoder: Converts text prompts to embeddings. Cross-attention layers in UNet use these embeddings to guide generation toward the prompt.
Timestep Embedding: Injected into UNet to condition the denoiser on which step of the reverse process it is. Typically sinusoidal positional encodings.
Different schedulers control which timesteps to sample and how to update predictions. The choice dramatically affects speed vs quality.
| Algorithm | Steps Typical | Speed | Quality | Stability | Best For |
|---|---|---|---|---|---|
| DDPM | 1000 | Slow | Highest | Very stable | Research, publication quality |
| DDIM | 50–100 | Fast | Good | Stable | General purpose, interactive |
| DPM-Solver | 20–50 | Very fast | Excellent | Very stable | Production inference |
| PLMS | 25–50 | Very fast | Very good | Stable | Fast generation |
| Euler | 30–100 | Fast | Very good | Medium | Creative exploration |
DPM-Solver recommended: Matches DDPM quality in 20–30 steps. Solves the ODE formulation of diffusion analytically. Currently the best speed-quality tradeoff.
Pre-trained models like Stable Diffusion can be adapted to new styles, objects, or concepts without retraining from scratch. Several lightweight approaches exist.
DreamBooth: Fine-tune on 3–5 images of a subject. Optimize a special token + UNet weights. Preserves general knowledge while learning the new concept. ~10 minutes on single GPU.
LoRA (Low-Rank Adaptation): Add trainable low-rank matrices to UNet layers. Dramatically reduces parameters (2–10%). Faster training, smaller checkpoints, composable.
Textual Inversion: Don't fine-tune the model. Instead, optimize a special text embedding that encodes a concept. Smallest checkpoint size, but less flexible.
Kohya Workflow: Community standard for LoRA training. YAML config-driven, supports gradient accumulation, mixed precision, multiple concepts.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers import UniPCMultistepScheduler
import torch
from PIL import Image
import numpy as np
def generate_with_canny_control(
prompt: str,
control_image: Image.Image, # edge-detected source image
strength: float = 0.8
) -> Image.Image:
"""Generate image guided by edge map using ControlNet."""
# Load ControlNet (Canny edge conditioning)
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
safety_checker=None
).to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload() # save VRAM
result = pipe(
prompt=prompt,
image=control_image,
num_inference_steps=30,
guidance_scale=7.5,
controlnet_conditioning_scale=strength
).images[0]
return result
# Typical use case: maintain object layout while changing style
# control_image = canny_edge_detect(source_photo)
# output = generate_with_canny_control(
# "A Van Gogh painting of a city street, starry night style",
# control_image=control_image
# )
Image generation quality is hard to measure automatically. Several metrics exist; most correlate imperfectly with human perception.
| Metric | Meaning | Range | Higher/Lower | Pros | Cons |
|---|---|---|---|---|---|
| FID (Fréchet Inception Distance) | Distance between feature distributions of real vs generated | 0–∞ | Lower is better | Fast, correlates with human judgment | Depends on Inception features, doesn't measure diversity |
| IS (Inception Score) | Sharpness + diversity of generated images | 0–1000 | Higher is better | Fast, no reference data needed | Biased toward Inception-like images, unstable |
| CLIP Score | Similarity between image and text prompt | 0–1 | Higher is better | Measures text alignment directly | CLIP has its own biases |
| LPIPS (Learned Perceptual Image Patch Similarity) | Perceptual distance using deep features | 0–1 | Lower is better | Better correlation with human perception | Computationally expensive |
| Human Evaluation | Direct rating or pairwise comparison | Variable | Higher is better | Ground truth | Expensive, time-consuming |
Best practice: Use FID for fast iterations, CLIP Score for prompt alignment, and human evaluation for final quality gates. Combine multiple metrics.
Getting good results from text-to-image models requires understanding how prompt structure affects output. Effective image prompts describe subject, style, lighting, composition, and quality modifiers explicitly. Negative prompts (in models that support them) specify what to exclude — "blurry, low resolution, distorted faces" — and often improve quality as much as the positive prompt itself.
For production use cases, the key choices are: model (quality vs cost vs speed), resolution (512×512 for drafts, 1024×1024 for finals), steps (20–30 usually sufficient), guidance scale (7–12 for prompt adherence), and seed (fix for reproducibility). Always generate multiple candidates and filter — a rejection rate of 50–70% for final production use is normal.
import requests, base64
from openai import OpenAI
from pathlib import Path
client = OpenAI()
def dalle3_generate(prompt: str, size: str = "1024x1024",
quality: str = "standard") -> str:
"""Generate image with DALL-E 3, return URL."""
response = client.images.generate(
model="dall-e-3",
prompt=prompt,
size=size,
quality=quality, # "standard" or "hd"
n=1
)
return response.data[0].url
def stable_diffusion_generate(prompt: str, negative_prompt: str = "",
steps: int = 30, guidance: float = 7.5) -> bytes:
"""Generate with local SD via diffusers."""
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16
).to("cuda")
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt or "blurry, low quality, distorted",
num_inference_steps=steps,
guidance_scale=guidance,
width=1024, height=1024
).images[0]
return image
# DALL-E 3: best quality, simple API, higher cost
url = dalle3_generate(
"A photorealistic 3D render of a neural network as a glowing blue web, "
"dark background, dramatic lighting, 8K, hyperdetailed"
)
print(f"DALL-E 3 URL: {url}")