Stable Diffusion

How stable diffusion works
Running with diffusers
SDXL and SD 3
Fine-tuning with LoRA
ControlNet for guided generation
Key parameters
Gotchas

SECTION 01

How stable diffusion works

Stable Diffusion is a latent diffusion model (LDM). The key insight: rather than doing diffusion in pixel space (computationally expensive), it does diffusion in a compressed latent space learned by a VAE (Variational Autoencoder). This makes SD orders of magnitude cheaper to run than earlier diffusion models like DALL-E 1.

The pipeline has three components: (1) a VAE encoder that compresses images to a 4×64×64 latent for training (8× spatial compression), (2) a U-Net denoiser trained to reverse the diffusion process in latent space, conditioned on text embeddings, and (3) a CLIP text encoder that converts prompts to embeddings.

At inference: start from Gaussian noise in latent space → denoise step by step using the U-Net (guided by the text embedding via cross-attention) → decode the final latent with the VAE decoder → pixel image. Typically 20–50 denoising steps.

SECTION 02

Running with diffusers

from diffusers import StableDiffusionPipeline
import torch

# SD 1.5 — the classic, smallest model (~2GB VRAM)
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

image = pipe(
    prompt="a serene japanese garden at sunset, photorealistic, 8k",
    negative_prompt="ugly, blurry, low quality, cartoon",
    num_inference_steps=30,    # quality vs speed trade-off
    guidance_scale=7.5,        # adherence to prompt (CFG scale)
    height=512, width=512,     # must be multiples of 8
).images[0]
image.save("output.png")

# Even simpler: ComfyUI (local, no-code node graph)
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI && pip install -r requirements.txt
python main.py --listen
# Open http://localhost:8188 in browser

SECTION 03

SDXL and SD 3

SDXL (Stable Diffusion XL): Released mid-2023. Key improvements: 1024×1024 native resolution (vs 512×512 for SD 1.5), larger U-Net (2.6B vs 860M params), two text encoders (OpenCLIP + CLIP-L), and a refiner model for final detail pass. Needs ~6GB VRAM at fp16.

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, use_safetensors=True,
).to("cuda")

image = pipe(
    prompt="a cat sitting on a windowsill in Monet's painting style",
    negative_prompt="blurry, ugly",
    num_inference_steps=30,
    guidance_scale=5.0,
    height=1024, width=1024,
).images[0]

SD 3 / SD 3.5: Released 2024. Uses a Multimodal Diffusion Transformer (MMDiT) instead of U-Net. Better text rendering, more accurate prompt following, and improved composition. SD 3.5 Large (8B) is the current highest-quality fully open model.

SECTION 04

Fine-tuning with LoRA

LoRA fine-tuning lets you teach Stable Diffusion a new concept (person, style, object) with just 10–30 training images, in under an hour on a consumer GPU.

# Using kohya_ss trainer (most popular LoRA training tool)
# Step 1: prepare dataset — 10-30 images + caption text files
# Step 2: run training
python train_network.py   --pretrained_model_name_or_path "stabilityai/stable-diffusion-xl-base-1.0"   --train_data_dir "./training_data"   --output_dir "./lora_output"   --network_module networks.lora   --network_dim 32   --network_alpha 16   --learning_rate 1e-4   --max_train_steps 1000   --save_every_n_steps 200

# Load and use the trained LoRA
from diffusers import StableDiffusionXLPipeline
pipe = StableDiffusionXLPipeline.from_pretrained("sdxl-base", torch_dtype=torch.float16).to("cuda")
pipe.load_lora_weights("./lora_output/trained_model.safetensors")
image = pipe("photo of john_doe sitting in a park, detailed, 4k").images[0]

SECTION 05

ControlNet for guided generation

ControlNet adds a conditioning signal to the diffusion process — you can provide a depth map, edge map, pose skeleton, or segmentation mask to control the spatial structure of the generated image while the content is determined by the text prompt.

from diffusers import ControlNetModel, StableDiffusionControlNetPipeline
import cv2
import numpy as np

# Canny edge ControlNet
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet, torch_dtype=torch.float16,
).to("cuda")

# Generate Canny edges from reference image
image = cv2.imread("reference.jpg")
edges = cv2.Canny(image, 100, 200)
edges_pil = Image.fromarray(edges)

# Generate image following the edge structure
result = pipe(
    "a beautiful watercolor painting of a cityscape",
    image=edges_pil,
    controlnet_conditioning_scale=0.8,
).images[0]

SECTION 06

Key parameters

guidance_scale (CFG): Classifier-Free Guidance scale. Higher = image follows prompt more strictly but may look oversaturated/unrealistic. 7–8 for photorealistic, 4–6 for creative. Values above 12 usually degrade quality.

num_inference_steps: Number of denoising steps. 20–30 is the sweet spot; more steps give diminishing returns and 20 is often indistinguishable from 50. With fast samplers (DPM++ 2M Karras), 20 steps is excellent.

scheduler: The ODE solver for the denoising process. DPM++ 2M Karras is the community favourite for speed+quality. DDIM is reliable. Euler Ancestral adds controlled randomness. Try a few for your use case.

negative prompt: Tokens to steer away from. Standard: "ugly, blurry, low quality, watermark, signature, text, bad anatomy, deformed". More impactful than most people expect.

seed: Set for reproducibility. Same seed + prompt + settings = same image.

SECTION 07

Gotchas

VRAM requirements vary by resolution: SD 1.5 at 512×512 needs ~2GB. SDXL at 1024×1024 needs ~6GB. Generating at 2048×2048 needs ~16GB+. For large images, use tiled diffusion or generate at base resolution and upscale.

Safetensors vs .ckpt: Prefer safetensors format for downloaded models — .ckpt files can contain arbitrary Python code (pickle format) and could be malicious. Always use models from trusted sources and prefer safetensors.

NSFW filter: The default diffusers pipeline includes a safety checker that blacks out NSFW content. For artistic or research use cases, this can be disabled, but this should be done responsibly.

Community models on Civitai: Civitai hosts thousands of fine-tuned SD models. Quality varies enormously. Look for models with high download counts, recent updates, and example images that match your target style.

Stable Diffusion Pipeline Variants

The Stable Diffusion model family has evolved significantly since the original release, with each major version bringing architectural improvements, better prompt following, higher resolution capability, and reduced generation artifacts. Understanding the differences between variants helps select the right model for each image generation use case.

Model	Architecture	Resolution	Prompt Following	Use Case
SD 1.5	UNet + CLIP	512×512	Moderate	Widest fine-tune ecosystem
SD 2.1	UNet + OpenCLIP	768×768	Better	Higher resolution needs
SDXL	2× UNet ensemble	1024×1024	Strong	High-quality production
SD 3	Diffusion Transformer	1024×1024	Very strong	Text rendering, accuracy
FLUX.1	Flow matching + DiT	Any	Excellent	Highest quality open

ControlNet extensions add spatial conditioning to the diffusion process, allowing generation to follow structural constraints like edge maps, depth maps, pose skeletons, and semantic segmentation masks. A ControlNet-conditioned generation starts from the same noise diffusion process but is guided at each denoising step to maintain the spatial structure of the condition image while applying the style and content specified in the text prompt. This enables applications like pose-consistent character generation, architecture visualization, and product placement that are impractical with text-only prompts.

LoRA fine-tuning for Stable Diffusion follows the same low-rank weight adaptation principle as LLM LoRA, but adapts the UNet's attention layers to capture a specific style, object, or person from a small set of training images (typically 10–30). A style LoRA trained on a painter's work can apply that artistic style to any text prompt; a subject LoRA trained on a person's photos can insert that person into generated scenes. LoRAs are small files (10–100MB) that can be freely combined with different base models and with each other, creating a rich ecosystem of reusable style and subject adapters.

Negative prompts in Stable Diffusion guide the diffusion process away from undesired visual elements by conditioning the classifier-free guidance calculation on both a positive prompt and a negative prompt. The denoising update is proportional to (positive - unconditional) + guidance_scale * (positive - negative), effectively amplifying movement toward the positive prompt while simultaneously moving away from elements described in the negative prompt. Common negative prompt patterns include "blurry, low quality, artifacts, deformed, watermark" to prevent common generation failure modes.

Inpainting workflows use Stable Diffusion to regenerate specific regions of an existing image while preserving the rest. A binary mask identifies which pixels should be regenerated; the unmasked region is encoded into the latent space and held fixed during denoising while the masked region starts from noise. This enables product image editing (replacing backgrounds), photo retouching (removing objects), and iterative scene composition where specific elements are refined without regenerating the entire image from scratch.

VAE (Variational Autoencoder) selection significantly affects the fine detail and color accuracy in Stable Diffusion outputs. The VAE encodes images to the latent space for the diffusion process and decodes them back to pixel space after denoising. The baked-in VAE of SD 1.5 is known to produce slightly washed-out colors; replacing it with an improved VAE fine-tuned specifically for better color saturation and sharpness produces noticeably cleaner outputs without changing any other model weights. For SDXL, the official SDXL VAE uses FP16 precision by default but benefits from using an FP16-fix variant to prevent NaN values during inference on some GPU configurations.