The algorithm that computes gradients for every parameter in a neural network via the chain rule — the foundation of all gradient-based learning.
A neural network is a function f(x; W) parameterized by weights W. Training means finding W that minimizes a loss L. Gradient descent requires knowing ∂L/∂W for every parameter. For a network with 7 billion parameters, computing this efficiently is the problem backpropagation solves.
Why not finite differences? Numerical gradient: (f(w+ε) - f(w-ε)) / 2ε per parameter. For 7B parameters, that's 14B forward passes per update — completely infeasible. Backprop computes all gradients in one pass.
Deep networks multiply many Jacobians together during backprop. If each is < 1, gradients shrink to zero (vanishing). If each is > 1, gradients explode to infinity.
Backprop normally stores all intermediate activations from the forward pass. For a 100-layer network, that's 100× the memory of the forward pass. Gradient checkpointing trades memory for compute.
Backpropagation through transformer attention is more memory-intensive than through simple feed-forward networks because storing the intermediate activations needed for the backward pass requires keeping the attention score matrices (sequence_length × sequence_length) for every layer in GPU memory during the forward pass. For long sequences, the memory cost of stored activations can exceed the model parameter memory — a 1024-token sequence through 32 attention layers requires storing 32 attention matrices of 1024×1024 floats. Gradient checkpointing mitigates this by recomputing activations during the backward pass instead of storing them, trading compute for memory.
The Adam optimizer stores two additional states per parameter — a running mean of gradients (first moment) and a running mean of squared gradients (second moment) — tripling the memory required for parameters and optimizer state compared to the model parameters alone. A 7B parameter model in float16 requires 14GB for parameters, but full Adam optimizer training requires an additional 28GB for optimizer states (stored in float32 for numerical stability), making Adam-based fine-tuning of 7B models require 42GB+ VRAM. AdamW-8bit from bitsandbytes reduces optimizer memory by 4x through 8-bit quantization of optimizer states while maintaining comparable convergence quality.
| Training component | Memory (7B model) | Reduction technique |
|---|---|---|
| Model parameters (fp16) | ~14GB | Quantize to 4-bit (QLoRA) |
| Adam optimizer states (fp32) | ~56GB | 8-bit Adam, Adafactor |
| Gradients (fp16) | ~14GB | Gradient accumulation |
| Activations | Variable | Gradient checkpointing |
Backpropagation operates on a computational graph constructed during forward pass: each tensor stores references to its parents and the operation that produced it. The graph is a directed acyclic graph (DAG) where nodes are tensors and edges represent operations (matmul, addition, activation). During backpropagation, PyTorch traverses this DAG in reverse topological order, computing gradients. The memory footprint depends on what's retained: storing activations for every layer enables fast backward computation but consumes O(depth) memory. Techniques like gradient checkpointing trade memory for computation: periodically discard activations during forward pass, then recompute them during backward (at cost of extra FLOPS). For models with depth > 100 layers, this reduces memory by 50%+ at ~30% compute overhead, crucial for fitting large models on limited hardware. Production implementations carefully balance: gradient checkpointing on expensive layers (transformer self-attention), retaining activations on cheap layers (linear projections), and profiling memory/compute trade-offs empirically for each architecture.
Backpropagation through many layers accumulates numerical errors: gradients can explode (unstable training) or vanish (gradients near zero after many layers prevent weight updates). Modern techniques address this: layer normalization stabilizes activations before they enter the next layer, preventing activation ranges from diverging; residual connections allow gradients to skip layers, addressing vanishing gradients; and mixed-precision training uses float16 for forward/backward but float32 for gradient accumulation, reducing memory while maintaining numerical precision. Understanding where numerical issues occur requires gradient monitoring: tracking gradient magnitude and variance across layers, and comparing to activation magnitude. Libraries like TensorFlow and PyTorch provide debugging tools: `tf.debugging.check_numerics()` halts on NaN/Inf, and PyTorch's `anomaly_detection` mode pinpoints the operation producing non-finite gradients. Production training pipelines often implement monitoring: logging gradient statistics to tensorboard, alerting when gradient norms diverge beyond expected ranges, and enabling easy rollback when training becomes numerically unstable.
PyTorch allows defining custom backward functions via `torch.autograd.Function`, enabling scenarios where standard gradient computation is insufficient. Common uses: straight-through estimators (STE) for discrete operations like quantization or sampling, where gradients don't truly exist but estimates enable learning; reweighting gradients for imbalanced classification (higher weight on rare classes); and approximating expensive-to-compute gradients via cheaper surrogates. Custom backward demands rigor: gradients must satisfy numerical correctness (checked via finite differences: does gradient*epsilon approximately equal loss change?), and must be tested independently of the broader model. For research and specialized applications, this flexibility is powerful: novel loss functions, modified attention mechanisms, and experimental gradient estimators all become feasible. However, mistakes in backward implementation can silently produce wrong gradients that don't trigger errors but degrade convergence subtly, so rigorous testing of custom backward functions is non-negotiable.