Training Tech

DeepSpeed

Microsoft's distributed training library with ZeRO optimizer stages — enables training 100B+ models across multiple GPUs by sharding optimizer states, gradients, and parameters.

ZeRO-3
Full Sharding
10×+
Larger Models
FSDP
Alternative

Table of Contents

SECTION 01

Why DeepSpeed?

Training a 70B model requires ~140GB for weights alone (bfloat16). Optimizer states (Adam) add another 280GB. With gradients, you need ~560GB+ — impossible on a single GPU. DeepSpeed's ZeRO (Zero Redundancy Optimizer) shards all of this across GPUs.

ZeRO StageWhat's ShardedMemory / GPU (70B)
Stage 0 (none)Nothing560 GB
Stage 1Optimizer states~400 GB
Stage 2Optimizer states + gradients~200 GB
Stage 3Everything (params too)~10 GB on 8 GPUs
SECTION 02

ZeRO Optimizer Stages

# ZeRO stage comparison for a 7B model across 8 × 80GB A100s: # # Stage 0 (DDP): 8 replicas × 56GB = 448GB total # Each GPU holds full model + optimizer states. Standard DDP. # # Stage 1: Shard optimizer states across GPUs # Optimizer states: 28GB → 28/8 = 3.5GB per GPU # Still holds full model (28GB) + gradients (28GB) per GPU # Per-GPU: ~60GB # # Stage 2: Stage 1 + shard gradients # Gradients: 28/8 = 3.5GB per GPU # Per-GPU: ~35GB # # Stage 3: Shard everything (params, grads, optimizer states) # Per-GPU: ~7GB for 7B model on 8 GPUs! # Tradeoff: communication overhead (gather params at each layer) # # Key insight: with 8 × 80GB A100s, even Stage 2 lets you train 65B+ models
SECTION 03

Config File

# ds_config.json — DeepSpeed configuration { "zero_optimization": { "stage": 2, // ZeRO Stage 2 "overlap_comm": true, // Overlap communication with compute "contiguous_gradients": true, // Reduce memory fragmentation "reduce_bucket_size": 5e8, // Communication bucket size "allgather_bucket_size": 5e8 }, "bf16": { "enabled": true // bfloat16 training }, "gradient_clipping": 1.0, // Max gradient norm "train_batch_size": "auto", // Set automatically from args "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": [0.9, 0.999], "weight_decay": "auto", "torch_adam": true } } }
SECTION 04

HuggingFace + DeepSpeed

from transformers import TrainingArguments, Trainer # DeepSpeed is a first-class option in HuggingFace Trainer args = TrainingArguments( output_dir="./output", deepspeed="ds_config.json", # Point to your config file per_device_train_batch_size=2, gradient_accumulation_steps=8, bf16=True, num_train_epochs=3, ) trainer = Trainer(model=model, args=args, train_dataset=train_ds) trainer.train() # Launch with torchrun (or deepspeed launcher): # torchrun --nproc_per_node=8 train.py # deepspeed --num_gpus=8 train.py # For Stage 3: add to config # "zero_optimization": { # "stage": 3, # "offload_optimizer": {"device": "cpu"}, // CPU offload optimizer states # "offload_param": {"device": "cpu"} // CPU offload parameters # } # ZeRO-3 + CPU offload: fine-tune 30B on 4× 40GB A6000s
SECTION 05

ZeRO-3 Inference

from deepspeed import init_inference import torch # Load a model that doesn't fit on one GPU for inference model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf") # DeepSpeed Inference: shard model across GPUs for inference ds_model = init_inference( model, mp_size=4, // Tensor parallel across 4 GPUs dtype=torch.bfloat16, replace_with_kernel_inject=True // Use DeepSpeed optimized kernels ) # Inference (same API as standard HuggingFace) inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda") outputs = ds_model.generate(**inputs, max_length=100) # ZeRO-Inference vs Tensor Parallel: # ZeRO-Inference: shards by parameter → less communication overhead # Tensor Parallel: shards by matrix dimension → better GPU utilization # Use ZeRO-Inference for large single-request scenarios
SECTION 06

Alternatives: FSDP

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy # PyTorch's built-in FSDP — equivalent to ZeRO-3 import functools from transformers.models.llama.modeling_llama import LlamaDecoderLayer wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls={LlamaDecoderLayer} ) model = FSDP( model, auto_wrap_policy=wrap_policy, device_id=torch.cuda.current_device() ) # HuggingFace Trainer supports FSDP natively: # TrainingArguments(fsdp="full_shard auto_wrap", fsdp_config={"fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer"}) # DeepSpeed vs FSDP: # DeepSpeed: more features (CPU offload, ZeRO-Infinity), harder to debug # FSDP: PyTorch native, simpler, better torch.compile compatibility # 2024 recommendation: FSDP for new projects, DeepSpeed if you need CPU offload
Starting point: For 1-2 GPUs: ZeRO Stage 2 via HuggingFace (just set deepspeed=...). For 4+ GPUs training large models: ZeRO Stage 3 or FSDP. For 65B+ models on limited hardware: ZeRO-3 + CPU offload.

Distributed Training with DeepSpeed

DeepSpeed enables large-scale distributed training through advanced memory optimization and gradient offloading techniques. The framework reduces per-GPU memory consumption by storing activations on CPU during backward pass, allowing larger effective batch sizes without out-of-memory errors.

# DeepSpeed distributed training setup
from deepspeed import initialize_engine
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

# Initialize with ZeRO-3 for full sharding
engine, optimizer, _, _ = initialize_engine(
    model=model,
    config={
        "train_batch_size": 256,
        "zero_optimization": {"stage": 3},
        "optimizer": {"type": "AdamW"}
    }
)

Multi-GPU Communication Patterns

Efficient distributed training requires understanding ring-based all-reduce collectives and gradient accumulation. DeepSpeed's communication library optimizes these patterns across NVIDIA's NCCL and AMD's RCCL, reducing network bottlenecks in large clusters.

Advanced distributed training with DeepSpeed requires understanding the trade-offs between ZeRO stages. ZeRO-1 shards optimizer states across devices with minimal communication overhead, suitable for dense GPU clusters. ZeRO-2 additionally shards gradients, reducing per-GPU memory by 50-70% at the cost of increased communication. ZeRO-3 full parameter sharding divides all parameters across devices, enabling training of models larger than single GPU memory (e.g., 70B+ parameter models on consumer hardware) but introducing 2-3x communication overhead that requires high-bandwidth interconnects. The choice depends on your network topology: 100Gbps+ requires ZeRO-3, while 10Gbps connections are better suited for ZeRO-1 with smaller models. Empirical benchmarks show ZeRO-3 effective batch size can reach 1024+ tokens/GPU on A100 clusters, versus 256-512 for traditional data parallelism. DeepSpeed's activation checkpointing reduces peak memory by 50% by recomputing activations during backward pass instead of storing them, trading CPU cycles for GPU memory. For production systems with SLA requirements, ZeRO-2 provides the best balance of training speed, memory efficiency, and communication overhead, achieving 2-3x speedup on 8-GPU setups compared to baseline PyTorch DDP.

ZeRO optimization stages introduce trade-offs between memory efficiency and communication cost. Stage 1 shards optimizer states (Adam moments) across 8 GPUs, reducing memory by ~2x, adding ~10% communication overhead. Stage 2 additionally shards gradients, achieving 4-6x memory reduction but with ~25% communication overhead suitable for local networks. Stage 3 shards parameters themselves, enabling 60-75% memory reduction but requiring synchronized all-gather operations that work best on high-bandwidth interconnects (NVLink, InfiniBand). Real-world measurements on A100 clusters show: ZeRO-1 maintains 700 samples/sec throughput on 8xA100, ZeRO-2 achieves 450 samples/sec (35% slowdown), ZeRO-3 achieves 180 samples/sec (75% slowdown) but enables batch sizes 4x larger. Choice depends on model size: <7B parameters use data parallelism, 7-13B use ZeRO-1 or 2, 13B+ use ZeRO-2 or 3. DeepSpeed's integrated infrastructure handles all complexity: configuration files specify optimization stage, compute cost, and memory budgets; the framework automatically schedules communication and computation. Integration with Hugging Face Trainer is seamless through environment variables or config files, enabling one-line adoption in existing training pipelines without code modification.

Production deployment of models trained with DeepSpeed requires careful consideration of inference efficiency. Training-time optimizations (ZeRO-3, activation checkpointing) are inappropriate for inference where each forward pass processes a single or small batch of samples. Inference optimization strategies include: model distillation (train smaller models matching larger models trained with DeepSpeed), quantization (reduce precision from fp32 to int8 with <2% accuracy loss), and knowledge distillation (soft targets from large model training). Inference throughput on single-GPU deployments achieves similar performance to training (100+ samples/sec for 7B models) by removing gradient computation overhead. Multi-GPU inference uses tensor parallelism to process single samples across multiple GPUs, critical for models requiring >80GB memory (Llama 70B, Mixtral). DeepSpeed-Inference module provides kernel optimizations for common operations, achieving 2-3x speedup on standard inference hardware. Cost analysis shows: training costs amortize quickly on large datasets, inference cost dominance emerges after 100K-1M inferences, favoring smaller distilled models or quantized versions for serving.

Operator fusion in DeepSpeed combines multiple operations to reduce kernel launch overhead and improve memory bandwidth utilization. Fused kernels implement multiple computation steps in single GPU kernel: matrix multiplication + bias addition + activation in one kernel reduces memory round-trips by 50%. Custom kernels for common patterns (softmax + dropout, layer normalization + bias) show 2-3x speedup compared to PyTorch standard kernels on modern GPUs. Kernel implementation complexity requires expertise in CUDA and GPU programming, limiting adoption to frameworks with dedicated optimization teams. DeepSpeed provides pre-implemented kernel fusion for attention mechanisms (multi-head self-attention as single fused kernel), GeLU activation, and layer normalization, covering 60-70% of typical transformer computation cost. Integration with graph compilation frameworks (TensorRT, Graphcore) enables automatic kernel fusion for new architectures. Benchmarking shows: standard PyTorch training achieves 50% GPU utilization, DeepSpeed with optimizations achieves 60-70%, fused kernels push to 75-80% utilization. Training efficiency gains compound: 1.5x from ZeRO, 1.3x from fused kernels, 1.2x from other optimizations yields 2.3x total speedup. Real-world impact: 70B parameter model training time reduces from 3 months to <2 months on 128xA100 cluster, enabling more iteration and experimentation.

Multi-host distributed training with DeepSpeed across 100+ nodes requires careful network optimization. High-bandwidth interconnects (NVIDIA NVLink, AMD Infinity Fabric, InfiniBand) reduce communication cost dramatically: NVLink at 900 GB/sec vs Ethernet at 100 MB/sec is 9000x difference. Network topology affects communication cost: fat-tree topology with redundant links prevents bottlenecks, direct-attach topologies cheaper but less robust. All-reduce communication pattern sums gradients across all devices: ring all-reduce minimizes bandwidth requirements, tree all-reduce minimizes latency with logarithmic depth. DeepSpeed gradient compression reduces communication volume 10-100x by quantizing gradients to lower precision before sending. Momentum compression exploits temporal correlation: only send large changes, accumulate small locally. Top-K sparsification sends only largest gradient components by magnitude, dramatically reducing communication. Compression adds computational overhead but communication dominates for large models. Bandwidth saturation in 100 GPU cluster with 100 Gbps network might achieve only 20 Gbps effective due to overhead. Optimization involves: tuning batch accumulation, gradient checkpointing reducing activation memory, bucketing strategies for efficient transmission.