Black Forest Labs' flow matching image generation model. FLUX.1[dev] and FLUX.1[schnell] rival or exceed DALL-E 3 at open weights. Uses a novel hybrid transformer-diffusion architecture.
Flux (Black Forest Labs, August 2024) introduces several architectural innovations over Stable Diffusion. It uses flow matching instead of DDPM denoising — a cleaner theoretical framework that trains a direct mapping from noise to image distributions, enabling faster sampling with fewer steps. The backbone is a hybrid: a stack of multimodal diffusion transformer (MM-DiT) blocks that jointly process image and text tokens, plus a final stack of single-stream transformer blocks. The text conditioning uses T5-XXL (4.9B parameter encoder) for rich semantic understanding — much more powerful than the CLIP text encoders in earlier SD models. Total parameter count: ~12B.
from diffusers import FluxPipeline
import torch
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload() # saves VRAM, slower but fits on 16GB GPU
# pipe.to("cuda") # if you have 24+ GB VRAM
image = pipe(
prompt="A serene Japanese garden at sunrise, golden light through bamboo, photorealistic",
guidance_scale=3.5, # Flux uses much lower CFG than SD (3-4 is typical)
num_inference_steps=50, # 50 for [dev]; 4 for [schnell]
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(42),
height=1024,
width=1024,
).images[0]
image.save("output.png")
# FLUX.1[schnell] (faster, 4-step):
pipe_schnell = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
torch_dtype=torch.bfloat16,
)
image = pipe_schnell(
prompt="A cat sitting on a windowsill watching rain",
num_inference_steps=4,
guidance_scale=0.0, # schnell uses CFG=0 (no guidance)
).images[0]
pip install diffusers accelerate peft
# Use the diffusers training script for Flux LoRA
# First prepare your dataset: 10-30 images + captions
# training_script: train_dreambooth_lora_flux.py
# python train_dreambooth_lora_flux.py \
# --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
# --instance_data_dir="./my_images" \
# --output_dir="./flux-lora" \
# --instance_prompt="a photo of sks person" \
# --resolution=512 \
# --train_batch_size=1 \
# --gradient_accumulation_steps=4 \
# --learning_rate=1e-4 \
# --lr_scheduler="constant" \
# --max_train_steps=500
# Load trained LoRA at inference
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
pipe.load_lora_weights("./flux-lora", weight_name="pytorch_lora_weights.safetensors")
pipe.to("cuda")
image = pipe("sks person in a space suit", num_inference_steps=50).images[0]
Memory requirements and options:
ollama run llama3 # (text only — no Flux in Ollama yet as of 2025)
# Use ComfyUI for Flux image generation locally
The Flux family offers different quality-speed tradeoffs across its variants, with the choice primarily determined by generation quality requirements and available compute. Flux.1-schnell is a distilled model with 4-step generation that produces good quality at significantly lower latency than the full models. Flux.1-dev is the non-commercial variant designed for research and development with full-quality generation. Flux.1-pro is the commercial variant with the highest generation quality, available through Black Forest Labs' API.
| Variant | Steps | Quality | License | VRAM required |
|---|---|---|---|---|
| Flux.1-schnell | 4 | Good | Apache 2.0 | ~12GB (fp8) |
| Flux.1-dev | 20–50 | High | Non-commercial | ~24GB (bf16) |
| Flux.1-pro | 25+ | Highest | Commercial API | API only |
Flux.1's Rectified Flow architecture with multi-modal diffusion transformers produces higher-fidelity image generation than earlier diffusion models, particularly for text rendering within images and precise spatial composition. The flow matching training objective produces a straighter sampling trajectory than DDPM-based models, enabling fewer denoising steps for equivalent quality. This architectural improvement explains Flux's emergence as the preferred foundation for fine-tuning and LoRA-based style transfer in the open-source community.
Flux implements rectified flow, an alternative to standard diffusion that straightens the path through the noise space during generation. Traditional diffusion models follow curved, inefficient paths from pure noise to data, requiring many steps (20-50+). Rectified flow reduces this to 1-4 steps by learning a straighter trajectory. The training process optimizes the velocity function—how fast to travel along the noise-to-image path—rather than predicting noise at each step. This subtle change in loss function (matching velocities instead of noise predictions) has profound effects: it enables high-quality generation with fewer sampling steps, making inference 10-50x faster than standard diffusion. For practitioners, this means Flux can generate images in under a second on consumer GPUs, unlocking interactive applications like real-time image editing and live content generation that were infeasible with older diffusion models.
Flux supports several mechanisms for controlling image generation: classifier-free guidance (CFG) uses unconditional predictions to steer output away from random noise toward prompt-aligned regions, and timestep-dependent weighting lets users emphasize guidance more strongly at certain phases (early steps for global composition, late steps for detail). Beyond guidance, Flux supports flexible conditioning: text prompts, regional masks, semantic maps, and structure guidance from edge or depth maps. The conditioning mechanism uses a separate encoding pathway for each modality, allowing users to combine text and structural constraints simultaneously (e.g., "generate a portrait of a woman" with a pose skeleton). Advanced users layer these: applying high guidance early to anchor global composition, then reducing guidance late to preserve fine-grained structure from conditioning inputs. This multi-stage approach produces results that respect both semantic intent and structural constraints, improving prompt-to-image fidelity.
Flux's inference can be optimized through several techniques beyond core algorithm improvements. LoRA (Low-Rank Adaptation) can fine-tune Flux for specific visual styles or concepts using ~50MB of parameters instead of the full 12B+ model, enabling fast model switching and personalization. Quantization techniques reduce model precision from float32 to int8 or float16, reducing memory footprint and compute requirements by 2-4x with minimal quality loss. Batch processing and prefetching hide latency of I/O operations. For applications like API services or iterative creative workflows, caching attention outputs across different generation steps can reduce redundant computation. Production deployments often combine these: serving a quantized base Flux model on NVIDIA GPUs, supporting swappable LoRA adapters for style personalization, and batching inference requests to amortize GPU overhead. This engineering approach pushes inference latency from 10-20 seconds per image down to 1-2 seconds, making real-time or near-real-time generation practical.
Flux implements rectified flow, an alternative to standard diffusion that straightens the path through the noise space during generation. Traditional diffusion models follow curved, inefficient paths from pure noise to data, requiring many steps (20-50+). Rectified flow reduces this to 1-4 steps by learning a straighter trajectory. The training process optimizes the velocity function—how fast to travel along the noise-to-image path—rather than predicting noise at each step. This subtle change in loss function (matching velocities instead of noise predictions) has profound effects: it enables high-quality generation with fewer sampling steps, making inference 10-50x faster than standard diffusion. For practitioners, this means Flux can generate images in under a second on consumer GPUs, unlocking interactive applications like real-time image editing and live content generation that were infeasible with older diffusion models.