Regularization

Dropout

Randomly zeros activations during training to prevent co-adaptation of neurons — less critical in modern LLMs but important for smaller models and fine-tuning.

p=0.1
Attention dropout
p=0
Large LLMs
Ensemble
Mechanism

Table of Contents

SECTION 01

How Dropout Works

During training, randomly set each activation to zero with probability p. Scale the remaining activations by 1/(1-p) to maintain expected magnitude. This prevents individual neurons from memorizing specific patterns — forcing the network to learn redundant representations.

import torch import torch.nn as nn # Manual dropout (training mode) x = torch.randn(4, 8) p = 0.3 # Drop 30% of neurons # Random mask: True where we KEEP, False where we drop mask = torch.rand_like(x) > p # Keep if random > p x_dropped = x * mask / (1 - p) # Apply mask + scale to maintain expected value # PyTorch Dropout layer (handles train/eval modes automatically) dropout = nn.Dropout(p=0.1) x = torch.randn(32, 512, 768) # (batch, seq, d_model) print(f"Zeros before: {(x == 0).sum().item()}") out = dropout(x) # 10% of values set to 0 print(f"Zeros after: {(out == 0).sum().item()}") # ~32*512*768*0.1 zeros
SECTION 02

Training vs Inference

import torch import torch.nn as nn model = nn.Sequential(nn.Linear(256, 256), nn.Dropout(0.1), nn.Linear(256, 10)) # Training mode: dropout is active model.train() out1 = model(x) out2 = model(x) print((out1 == out2).all()) # False — different dropout masks! # Eval mode: dropout is disabled (identity function) model.eval() out1 = model(x) out2 = model(x) print((out1 == out2).all()) # True — deterministic # Common mistake: forgetting to call model.eval() at inference # → Dropout still active → inconsistent outputs + lower effective capacity # ALWAYS call model.eval() before inference/evaluation with torch.no_grad(): model.eval() predictions = model(test_input)
SECTION 03

Dropout in Transformers

from transformers import AutoModelForCausalLM, AutoConfig # Transformer dropout settings config = AutoConfig.from_pretrained("meta-llama/Llama-2-7b-hf") print(config.attention_dropout) # 0.0 in LLaMA (disabled) print(config.hidden_dropout) # 0.0 in most modern LLMs # Why modern LLMs use dropout=0? # - Large models trained on huge data: data regularizes enough # - Dropout slows training without benefit when data >> model capacity # - Better regularization: weight decay, gradient clipping # Smaller models or fine-tuning: still useful config = AutoConfig.from_pretrained("bert-base-uncased") print(config.hidden_dropout_prob) # 0.1 — BERT uses dropout print(config.attention_probs_dropout_prob) # 0.1 # For fine-tuning: sometimes helpful to re-enable dropout model = AutoModelForCausalLM.from_pretrained("gpt2") # Access through config model.config.resid_pdrop = 0.1 model.config.attn_pdrop = 0.1
SECTION 04

When to Use It

ScenarioDropout RateRationale
LLM pre-training (large)0.0Data regularizes; dropout slows training
Small transformers (BERT-base)0.1Prevents overfitting on smaller datasets
Fine-tuning on small dataset0.05–0.1Prevents overfitting to a few thousand examples
Classification head0.1–0.3New head benefits from regularization
MC Dropout (uncertainty)0.1–0.2Keep dropout at inference for Bayesian approximation
SECTION 05

Variants

import torch import torch.nn as nn # Standard Dropout: zero individual neurons dropout = nn.Dropout(p=0.1) # Dropout2d: zero entire channels (for CNNs) dropout2d = nn.Dropout2d(p=0.1) # Input (N, C, H, W) → zero entire C channels # AlphaDropout: maintains mean and std (for SELU activations) alpha_dropout = nn.AlphaDropout(p=0.1) # Attention Dropout: drop attention weights # Used in: BERT, T5 # attention_probs = F.dropout(attention_probs, p=0.1, training=self.training) # MC Dropout: keep dropout active at inference for uncertainty class BayesianModel(nn.Module): def __init__(self): super().__init__() self.layers = nn.Sequential(nn.Linear(256, 256), nn.Dropout(0.2)) def predict_with_uncertainty(self, x, n_samples=20): self.train() # Keep dropout active! preds = [self(x) for _ in range(n_samples)] return torch.stack(preds).mean(0), torch.stack(preds).std(0)
SECTION 06

Practical Guide

import torch.nn as nn # Fine-tuning recipe with dropout class FineTunedClassifier(nn.Module): def __init__(self, backbone, n_classes): super().__init__() self.backbone = backbone # Classification head with dropout self.head = nn.Sequential( nn.Dropout(0.1), # Regularize the head nn.Linear(768, 256), nn.GELU(), nn.Dropout(0.1), nn.Linear(256, n_classes) ) def forward(self, **inputs): outputs = self.backbone(**inputs) cls_token = outputs.last_hidden_state[:, 0, :] # [CLS] token return self.head(cls_token) # If overfitting (val loss > train loss): # 1. Increase dropout from 0.1 → 0.2 → 0.3 # 2. Add weight decay (AdamW weight_decay=0.1) # 3. Reduce model size or training epochs # 4. Get more data (most effective)
Monitoring: If val loss is decreasing (no overfitting), dropout may be hurting you. If val loss plateaus while train loss decreases, increase dropout or weight_decay.
SECTION 07

Dropout Schedules and Adaptive Dropout

Standard dropout applies a fixed probability p throughout training. Adaptive dropout varies p during training or across layers, often starting high (0.5) and decaying toward zero. Some architectures use layer-dependent dropout where deeper layers have higher dropout rates to prevent the representational collapse that can occur in very deep networks.

SECTION 08

Dropout in Production Deployment

A critical gotcha: always call model.eval() before inference to disable dropout. Leaving dropout enabled during inference adds unwanted stochasticity to predictions, degrading reproducibility and performance. For uncertainty quantification, use dedicated Bayesian techniques or Monte Carlo dropout with explicit forward passes, not accidental training-mode inference.

Dropout implementation details: During training, dropout randomly sets activations to zero with probability p, then scales remaining activations by 1/(1-p) to maintain expected value. This scaling (called inverted dropout) is crucial: without it, disabling dropout at inference would effectively change the model's learned feature magnitudes. PyTorch's nn.Dropout implements inverted dropout automatically, making the training→inference transition seamless as long as eval() mode is used correctly.

Dropout interacts with batch normalization in non-obvious ways. Placing dropout before batch norm can interfere with the norm's centering and scaling. Placing it after allows batch norm to stabilize the training signal before dropout's stochasticity. Modern best practices suggest dropout after the activation function but sometimes before the next layer's input. The interaction depends on the specific architecture; empirical evaluation on the target problem is often necessary.

Dropout rates vary significantly across architectures and domains. Vision transformers often use minimal dropout (0.0-0.1) because data augmentation handles regularization. Language models may use 0.1-0.3 for moderate regularization. Very deep networks may use depth-dependent dropout (increasing with depth) to prevent severe feature collapse in later layers. Tuning dropout rate is often overlooked in hyperparameter sweeps but can significantly impact generalization.

EXTRA

Dropout in Context and Alternatives

Dropout is one regularization technique among many. Batch normalization acts as a regularizer by adding noise through batch statistics. Data augmentation (random crops, rotations, color jittering) provides implicit regularization. Early stopping halts training when validation performance plateaus. Ensemble methods combine multiple models for better generalization. Modern best practice often combines multiple techniques—dropout alone is rarely sufficient.

For recurrent networks (RNNs, LSTMs), standard dropout on activations causes information loss across timesteps. Variational dropout (applying the same dropout mask across timesteps) preserves information flow while still regularizing. This distinction explains why standard dropout can severely damage RNN performance while variational dropout preserves it. Implementation details like this matter deeply for architectural choices.

Dropout probability schedules (varying p during training) have been explored: starting with no dropout, gradually increasing p during training, then disabling it near convergence. The theoretical justification is that the model needs sufficient signal early in training before regularization becomes helpful. Empirical results are mixed, suggesting that fixed dropout rates tuned via validation sets typically outperform complex schedules.

Theoretical analysis of dropout reveals connections to ensemble methods. Applying dropout is equivalent to training an exponentially large ensemble of networks (with shared weights). Each forward pass with dropout samples from this ensemble. At inference with dropout disabled, the predictions average over the ensemble. This ensemble interpretation explains why dropout improves generalization—ensemble methods have strong generalization guarantees from statistical learning theory.

Bernoulli dropout (the standard) randomly zeros activations with fixed probability. Variational dropout (used in RNNs) uses the same dropout mask across timesteps. Concrete dropout treats dropout rates as learnable parameters, optimized via backprop to maximize generalization. Spatial dropout (used in CNNs) zeros entire feature maps instead of individual activations. Choosing the right dropout variant for your architecture and domain matters significantly.

Modern architectures often minimize dropout (sometimes using zero) because data augmentation and other regularization techniques provide sufficient regularization. Vision transformers with strong augmentation (RandAugment, Mixup) often use minimal dropout. Language models rely on dropout but are trending toward lower rates as model capacity and dataset size increase. There's a general trend toward data-centric over regularization-centric approaches.

BOOST

Dropout and Uncertainty Quantification

Monte Carlo dropout uses multiple forward passes with dropout enabled to estimate model uncertainty. By running the same input multiple times with different dropout masks, you get different predictions. The distribution of these predictions estimates confidence. High variance across runs indicates low confidence; low variance indicates high confidence. This technique enables uncertainty quantification without modifying the model, using the regularization mechanism as an uncertainty estimator.

Applications of uncertainty quantification include active learning (querying unlabeled examples where the model is most uncertain), out-of-distribution detection (rejecting samples where uncertainty is anomalously high), and calibration (relating reported confidence to actual accuracy). Dropout-based uncertainty is approximate (not true Bayesian posteriors) but computationally efficient and often effective in practice.

Comparing dropout to other uncertainty methods (ensemble, temperature scaling, Laplace approximation, full Bayesian) reveals trade-offs. Dropout is cheap and easy to implement; full Bayesian is theoretically correct but computationally expensive. Ensembles are effective but require multiple models. Practitioners should choose uncertainty methods based on computational budget and accuracy requirements, not religious adherence to any single approach.