Flux

Flux architecture
FLUX variants
Generating images with diffusers
Fine-tuning Flux with LoRA
Running Flux locally
Flux vs DALL-E 3 vs SD3
Gotchas

SECTION 01

Flux architecture

Flux (Black Forest Labs, August 2024) introduces several architectural innovations over Stable Diffusion. It uses flow matching instead of DDPM denoising — a cleaner theoretical framework that trains a direct mapping from noise to image distributions, enabling faster sampling with fewer steps. The backbone is a hybrid: a stack of multimodal diffusion transformer (MM-DiT) blocks that jointly process image and text tokens, plus a final stack of single-stream transformer blocks. The text conditioning uses T5-XXL (4.9B parameter encoder) for rich semantic understanding — much more powerful than the CLIP text encoders in earlier SD models. Total parameter count: ~12B.

SECTION 02

FLUX variants

FLUX.1[pro]: Closed model, available via API from Black Forest Labs and Replicate. Best quality.
FLUX.1[dev]: Open weights (non-commercial license). Comparable quality to [pro], 50 steps for best quality. ~23 GB VRAM for fp16. The practical choice for research and personal projects.
FLUX.1[schnell]: Open weights (Apache 2.0). 4-step fast generation. Slightly lower quality than [dev] but 10× faster. ~23 GB for fp16; quantised versions (~8-12GB) work on consumer hardware.

SECTION 03

Generating images with diffusers

from diffusers import FluxPipeline
import torch

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload()  # saves VRAM, slower but fits on 16GB GPU
# pipe.to("cuda")  # if you have 24+ GB VRAM

image = pipe(
    prompt="A serene Japanese garden at sunrise, golden light through bamboo, photorealistic",
    guidance_scale=3.5,      # Flux uses much lower CFG than SD (3-4 is typical)
    num_inference_steps=50,  # 50 for [dev]; 4 for [schnell]
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(42),
    height=1024,
    width=1024,
).images[0]

image.save("output.png")

# FLUX.1[schnell] (faster, 4-step):
pipe_schnell = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    torch_dtype=torch.bfloat16,
)
image = pipe_schnell(
    prompt="A cat sitting on a windowsill watching rain",
    num_inference_steps=4,
    guidance_scale=0.0,  # schnell uses CFG=0 (no guidance)
).images[0]

SECTION 04

Fine-tuning Flux with LoRA

pip install diffusers accelerate peft

# Use the diffusers training script for Flux LoRA
# First prepare your dataset: 10-30 images + captions

# training_script: train_dreambooth_lora_flux.py
# python train_dreambooth_lora_flux.py \
#   --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
#   --instance_data_dir="./my_images" \
#   --output_dir="./flux-lora" \
#   --instance_prompt="a photo of sks person" \
#   --resolution=512 \
#   --train_batch_size=1 \
#   --gradient_accumulation_steps=4 \
#   --learning_rate=1e-4 \
#   --lr_scheduler="constant" \
#   --max_train_steps=500

# Load trained LoRA at inference
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
pipe.load_lora_weights("./flux-lora", weight_name="pytorch_lora_weights.safetensors")
pipe.to("cuda")
image = pipe("sks person in a space suit", num_inference_steps=50).images[0]

SECTION 05

Running Flux locally

Memory requirements and options:

fp16 full precision: ~23 GB VRAM — requires A100/H100 or multiple consumer GPUs
bfloat16 + sequential offload: ~12-16 GB VRAM — works on RTX 3090/4090
GGUF quantised (Q4): ~8 GB VRAM via ComfyUI or llama.cpp image — good quality, slower
ComfyUI: Most popular UI for Flux; handles quantised models and custom LoRA workflows
Ollama: Coming in newer versions; currently best via ComfyUI or direct diffusers

ollama run llama3  # (text only — no Flux in Ollama yet as of 2025)
# Use ComfyUI for Flux image generation locally

SECTION 06

Flux vs DALL-E 3 vs SD3

Prompt following: Flux ≈ DALL-E 3 > SD3. Flux's T5-XXL text encoder gives strong semantic understanding.
Image quality: Flux[dev] ≈ DALL-E 3 HD at portrait/artistic. DALL-E 3 slightly better at text rendering.
Speed: DALL-E 3 ~6s (API). Flux[schnell] 4 steps ~2s locally on A100. Flux[dev] 50 steps ~20s.
Cost: DALL-E 3 $0.04–$0.12/image. Flux[dev] — just compute cost, free at scale with own hardware.
Fine-tuning: Flux supports LoRA fine-tuning. DALL-E 3 does not.

SECTION 07

Gotchas

Low guidance scale: Flux uses classifier-free guidance scale of 3-4, much lower than Stable Diffusion (7-9). Using higher CFG with Flux causes oversaturated, artifact-heavy images. Start at 3.5 and adjust minimally.
FLUX.1[dev] license: Non-commercial use only. FLUX.1[schnell] is Apache 2.0 (commercial OK). Check licensing before production deployment.
Quantisation quality: GGUF Q4 quantisation noticeably reduces fine detail and texture quality vs full precision. For best results use bfloat16; quantise only if VRAM is the constraint.

Flux model comparison and selection

The Flux family offers different quality-speed tradeoffs across its variants, with the choice primarily determined by generation quality requirements and available compute. Flux.1-schnell is a distilled model with 4-step generation that produces good quality at significantly lower latency than the full models. Flux.1-dev is the non-commercial variant designed for research and development with full-quality generation. Flux.1-pro is the commercial variant with the highest generation quality, available through Black Forest Labs' API.

Variant	Steps	Quality	License	VRAM required
Flux.1-schnell	4	Good	Apache 2.0	~12GB (fp8)
Flux.1-dev	20–50	High	Non-commercial	~24GB (bf16)
Flux.1-pro	25+	Highest	Commercial API	API only

Flux.1's Rectified Flow architecture with multi-modal diffusion transformers produces higher-fidelity image generation than earlier diffusion models, particularly for text rendering within images and precise spatial composition. The flow matching training objective produces a straighter sampling trajectory than DDPM-based models, enabling fewer denoising steps for equivalent quality. This architectural improvement explains Flux's emergence as the preferred foundation for fine-tuning and LoRA-based style transfer in the open-source community.

Rectified flow and training dynamics

Flux implements rectified flow, an alternative to standard diffusion that straightens the path through the noise space during generation. Traditional diffusion models follow curved, inefficient paths from pure noise to data, requiring many steps (20-50+). Rectified flow reduces this to 1-4 steps by learning a straighter trajectory. The training process optimizes the velocity function—how fast to travel along the noise-to-image path—rather than predicting noise at each step. This subtle change in loss function (matching velocities instead of noise predictions) has profound effects: it enables high-quality generation with fewer sampling steps, making inference 10-50x faster than standard diffusion. For practitioners, this means Flux can generate images in under a second on consumer GPUs, unlocking interactive applications like real-time image editing and live content generation that were infeasible with older diffusion models.

Guidance, conditioning, and prompt adherence

Flux supports several mechanisms for controlling image generation: classifier-free guidance (CFG) uses unconditional predictions to steer output away from random noise toward prompt-aligned regions, and timestep-dependent weighting lets users emphasize guidance more strongly at certain phases (early steps for global composition, late steps for detail). Beyond guidance, Flux supports flexible conditioning: text prompts, regional masks, semantic maps, and structure guidance from edge or depth maps. The conditioning mechanism uses a separate encoding pathway for each modality, allowing users to combine text and structural constraints simultaneously (e.g., "generate a portrait of a woman" with a pose skeleton). Advanced users layer these: applying high guidance early to anchor global composition, then reducing guidance late to preserve fine-grained structure from conditioning inputs. This multi-stage approach produces results that respect both semantic intent and structural constraints, improving prompt-to-image fidelity.

Inference optimization and LoRA efficiency

Flux's inference can be optimized through several techniques beyond core algorithm improvements. LoRA (Low-Rank Adaptation) can fine-tune Flux for specific visual styles or concepts using ~50MB of parameters instead of the full 12B+ model, enabling fast model switching and personalization. Quantization techniques reduce model precision from float32 to int8 or float16, reducing memory footprint and compute requirements by 2-4x with minimal quality loss. Batch processing and prefetching hide latency of I/O operations. For applications like API services or iterative creative workflows, caching attention outputs across different generation steps can reduce redundant computation. Production deployments often combine these: serving a quantized base Flux model on NVIDIA GPUs, supporting swappable LoRA adapters for style personalization, and batching inference requests to amortize GPU overhead. This engineering approach pushes inference latency from 10-20 seconds per image down to 1-2 seconds, making real-time or near-real-time generation practical.

Table of Contents

Flux architecture

FLUX variants

Generating images with diffusers

Fine-tuning Flux with LoRA

Running Flux locally

Flux vs DALL-E 3 vs SD3

Gotchas

Flux model comparison and selection

Rectified flow and training dynamics

Guidance, conditioning, and prompt adherence

Inference optimization and LoRA efficiency

Rectified flow and training dynamics

Related concepts