Foundations · Training

Regularisation Techniques

Dropout, weight decay, label smoothing, data augmentation, and early stopping — preventing overfitting in deep learning

6 Techniques
6 Sections
Python-first Code Examples
In this guide
  1. Overfitting Problem
  2. Weight Decay
  3. Dropout
  4. Label Smoothing
  5. Data Augmentation
  6. Early Stopping
  7. LLM Regularisation
01 — Fundamentals

The Overfitting Problem

Overfitting occurs when a model memorises training data instead of learning generalizable patterns. The training loss decreases while validation loss increases — a telltale sign. The underlying cause is that models have enough capacity to fit noise in the training set.

Bias-Variance Tradeoff

Every model exhibits bias (underfitting due to insufficient capacity) and variance (overfitting due to fitting noise). Regularisation reduces variance at a small cost to bias. The goal is to find the sweet spot that minimizes test error.

💡 Regularisation is not always beneficial. On small datasets with limited capacity, adding regularisation helps. On massive datasets with weak priors, regularisation can hurt. Always measure on a held-out test set.

When Regularisation Helps vs Hurts

Regularisation helps when: training loss is much lower than validation loss, you have a small dataset relative to model capacity, or you want to enforce prior knowledge. Regularisation hurts when: you're underfitting (both losses are high), your dataset is enormous, or your prior is wrong (e.g., L2 regularisation on a scale-invariant model).

# Diagnosing overfitting import matplotlib.pyplot as plt epochs = list(range(100)) train_losses = [...] # from training loop val_losses = [...] # from validation plt.plot(epochs, train_losses, label='Train') plt.plot(epochs, val_losses, label='Validation') plt.xlabel('Epoch') plt.ylabel('Loss') plt.legend() plt.show() # Overfitting signature: # - Train loss: ↓↓↓ (keeps decreasing) # - Val loss: ↓↓ then ↑↑ (increases after convergence)
02 — Penalty Term

L1, L2, and Weight Decay

Weight decay penalises large weights. L2 weight decay (Ridge regression) adds a penalty term proportional to the sum of squared weights. L1 weight decay (Lasso) penalises the sum of absolute values, encouraging sparsity. Both prevent the model from relying on any single feature too heavily.

L2 vs L1 Comparison

MethodPenalty FormulaEffect on WeightsSparsityUse Case
L2 (Ridge)λ × Σ(w²)Small, distributedNoDefault; smooth regularisation
L1 (Lasso)λ × Σ(|w|)Many zeros, few largeYesFeature selection; interpretability
Elastic Netλ₁×Σ(|w|) + λ₂×Σ(w²)Hybrid: zeros + smallPartialCombined L1+L2 benefits

Weight Decay in PyTorch

In PyTorch, weight decay is specified via the optimizer. Most optimizers support the weight_decay parameter, which implements L2 regularisation. Important: weight decay in modern optimizers (AdamW) differs slightly from classical L2 — it's applied directly to weights, not gradients.

import torch import torch.nn as nn import torch.optim as optim model = nn.Linear(10, 5) # L2 weight decay (Ridge) optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4) # Training step loss = criterion(model(x), y) optimizer.zero_grad() loss.backward() optimizer.step() # weight_decay is applied: w ← w - lr * (grad + weight_decay * w) # Manual L2 regularisation (for comparison) l2_lambda = 1e-4 l2_loss = sum(torch.sum(p ** 2) for p in model.parameters()) total_loss = loss + l2_lambda * l2_loss

Choosing Lambda

The weight decay coefficient λ controls regularisation strength. Too small: regularisation has no effect. Too large: the model underfits. Common practice is to grid search λ ∈ {1e-5, 1e-4, 1e-3, 1e-2} and pick the value with the best validation loss. Start at 1e-4 as a default.

03 — Stochastic Regularisation

Dropout: Random Neuron Deactivation

Dropout randomly zeroes activations during training with probability p. It forces the network to learn redundant representations — no single neuron becomes critical. At inference, all neurons are used but scaled down by (1-p) to match expected values. Dropout is free regularisation: no hyperparameter tuning beyond p.

Training vs Inference Mode

This is critical: dropout behaves differently in training and inference. During training, it drops activations randomly. During inference (model.eval()), it's disabled and all activations are used. PyTorch handles this automatically.

import torch import torch.nn as nn class DropoutModel(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(128, 256) self.dropout = nn.Dropout(p=0.5) # drop 50% of activations self.fc2 = nn.Linear(256, 10) def forward(self, x): x = self.fc1(x) x = torch.relu(x) x = self.dropout(x) # Applied in training; disabled in eval x = self.fc2(x) return x model = DropoutModel() # Training: dropout is active model.train() x = torch.randn(32, 128) y_train = model(x) # Some activations randomly zeroed # Inference: dropout is disabled model.eval() y_test = model(x) # All activations used, scaled by (1-p)

Dropout Rate Selection

Typical dropout rates: 0.2–0.5. Higher rates (e.g., 0.7) for large models or small datasets. Lower rates (0.1–0.2) for small models or large datasets. DropPath (dropping entire neurons in a layer) is used in vision transformers. Attention dropout is common in LLMs to prevent reliance on specific attention heads.

MC-Dropout for Uncertainty

By keeping dropout active at inference and running multiple forward passes, you get uncertainty estimates. Variations across runs reflect model uncertainty — useful for detecting out-of-distribution inputs.

# MC-Dropout: multiple forward passes with dropout enabled model.train() # Keep dropout active num_samples = 10 predictions = [] with torch.no_grad(): for _ in range(num_samples): logits = model(x_test) predictions.append(logits) predictions = torch.stack(predictions) # [num_samples, batch_size, num_classes] mean_pred = predictions.mean(dim=0) uncertainty = predictions.std(dim=0) # High uncertainty suggests OOD example
04 — Target Modification

Label Smoothing: Soft Targets

Label smoothing replaces hard targets (one-hot vectors) with soft targets. Instead of [0, 1, 0], use [0.05, 0.9, 0.05] (for 3 classes). It prevents the model from becoming overconfident and improves calibration — predicted probabilities better reflect actual accuracy.

Effect on Calibration

Hard targets encourage the model to assign probability 1.0 to the correct class, leading to overconfident predictions. Label smoothing encourages spreading probability mass across all classes, resulting in more calibrated probabilities. This is especially important for uncertainty quantification.

import torch import torch.nn.functional as F def label_smoothing_loss(logits, targets, epsilon=0.1, num_classes=10): """Compute cross-entropy with label smoothing.""" # Create soft targets: 1-epsilon on correct class, epsilon/(K-1) on others soft_targets = torch.ones_like(logits) * (epsilon / (num_classes - 1)) soft_targets.scatter_(1, targets.unsqueeze(1), 1 - epsilon) # Compute cross-entropy with soft targets log_probs = F.log_softmax(logits, dim=1) loss = -(soft_targets * log_probs).sum(dim=1).mean() return loss # Example logits = torch.randn(32, 10) targets = torch.randint(0, 10, (32,)) loss = label_smoothing_loss(logits, targets, epsilon=0.1)

Typical Epsilon Values

ε = 0.1 is a common default for most tasks. ε = 0.01–0.05 for tasks requiring high confidence (e.g., medical diagnosis). ε = 0.2–0.3 for datasets with noisy labels. The tradeoff: higher ε reduces overfitting but can hurt accuracy if the dataset is clean.

05 — Synthetic Training Data

Data Augmentation: Creating Diversity

Data augmentation creates new training examples by applying transformations. For images: rotations, crops, color jitter, mixup. For text: back-translation, paraphrasing, word replacement. For both: mixup (interpolating between examples) is powerful. It's free data that improves generalisation.

Image Augmentation

PyTorch's torchvision.transforms provides standard image transforms. Combine multiple transforms in a pipeline. Be careful not to destroy label semantics (e.g., don't flip images where orientation matters, like text).

from torchvision import transforms # Standard augmentation pipeline train_transform = transforms.Compose([ transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.RandomRotation(15), transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1), transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) ]) val_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) ]) # Use in DataLoader dataset = MyImageDataset(transform=train_transform) loader = DataLoader(dataset, batch_size=32)

Text Augmentation

For text: back-translation (translate to another language and back), paraphrasing using language models, and synonym replacement. The nlpaug library provides standard techniques.

Mixup: Interpolating Examples

Mixup creates synthetic examples by interpolating between real examples: x' = λx_i + (1-λ)x_j and y' = λy_i + (1-λ)y_j, where λ ~ Beta(α, α). Simple to implement and often improves generalisation by 1–2% accuracy.

# Mixup augmentation import torch import numpy as np def mixup(x, y, alpha=1.0): """Apply mixup to a batch.""" batch_size = x.size(0) # Sample mixing coefficient from Beta distribution lam = torch.from_numpy(np.random.beta(alpha, alpha)).float() # Random permutation for mixing index = torch.randperm(batch_size) # Interpolate inputs and targets mixed_x = lam * x + (1 - lam) * x[index, :] mixed_y = lam * y + (1 - lam) * y[index] return mixed_x, mixed_y # Use in training for x, y in train_loader: x, y = mixup(x, y, alpha=1.0) logits = model(x) loss = criterion(logits, y)
06 — Dynamic Stopping

Early Stopping and Learning Rate Schedules

Early stopping monitors validation loss and stops training when it plateaus or increases. It prevents overfitting without explicit hyperparameter tuning. Learning rate schedules (cosine annealing, warmup, OneCycleLR) reduce the learning rate over time, allowing finer refinement as training progresses. Both are forms of implicit regularisation.

Early Stopping Implementation

Track validation loss at epoch boundaries. Stop if loss hasn't improved for patience consecutive epochs. Save the best checkpoint and load it at the end.

class EarlyStopping: def __init__(self, patience=5, min_delta=0): self.patience = patience self.min_delta = min_delta self.counter = 0 self.best_loss = None self.should_stop = False def __call__(self, val_loss): if self.best_loss is None: self.best_loss = val_loss elif val_loss < self.best_loss - self.min_delta: self.best_loss = val_loss self.counter = 0 else: self.counter += 1 if self.counter >= self.patience: self.should_stop = True # Use in training loop early_stopping = EarlyStopping(patience=3) for epoch in range(num_epochs): train_loss = train_one_epoch(...) val_loss = evaluate(...) early_stopping(val_loss) if early_stopping.should_stop: print(f"Early stopping at epoch {epoch}") break

Learning Rate Schedules

Cosine annealing: learning rate starts high, smoothly decreases to zero. Warmup: small learning rate initially, then increases to peak. OneCycleLR: combines both — ramp up, then ramp down. These schedules improve convergence and reduce overfitting.

from torch.optim.lr_scheduler import ( CosineAnnealingLR, LinearLR, OneCycleLR ) optimizer = optim.SGD(model.parameters(), lr=0.1) # Cosine annealing: LR → 0 over time scheduler = CosineAnnealingLR(optimizer, T_max=100) # Linear warmup + cosine decay (common in transformers) scheduler = torch.optim.lr_scheduler.SequentialLR( optimizer, schedulers=[ LinearLR(optimizer, start_factor=0.1, total_iters=10), CosineAnnealingLR(optimizer, T_max=90) ], milestones=[10] ) # OneCycleLR (used in FastAI) scheduler = OneCycleLR( optimizer, max_lr=0.1, total_steps=len(train_loader) * num_epochs, pct_start=0.3, anneal_strategy='cos' ) # In training loop for epoch in range(num_epochs): for batch in train_loader: loss = ... loss.backward() optimizer.step() optimizer.zero_grad() scheduler.step() # Update LR every batch
07 — LLM-Specific Patterns

Regularisation in LLMs

Language models use domain-specific regularisation techniques. Attention dropout and residual dropout prevent overfitting without hurting expressiveness. Weight tying (sharing parameters across layers) reduces parameters while improving regularisation. Z-loss and entropy penalties encourage calibration.

Attention and Residual Dropout

In transformer models, dropout is applied to attention weights and residual connections. This is different from fully-connected layers: it prevents the model from relying on specific attention heads and ensures gradients flow smoothly through deep networks.

# Typical transformer dropout configuration class TransformerBlock(nn.Module): def __init__(self, dim, num_heads, dropout=0.1): super().__init__() self.attention = nn.MultiheadAttention(dim, num_heads, dropout=dropout) self.mlp = nn.Sequential( nn.Linear(dim, 4 * dim), nn.ReLU(), nn.Dropout(dropout), nn.Linear(4 * dim, dim), nn.Dropout(dropout) ) self.dropout = nn.Dropout(dropout) self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) def forward(self, x): # Self-attention with dropout attn_out, _ = self.attention(x, x, x) attn_out = self.dropout(attn_out) x = x + attn_out # Residual connection x = self.norm1(x) # MLP with dropout mlp_out = self.mlp(x) x = x + mlp_out # Residual connection x = self.norm2(x) return x

Z-Loss and Auxiliary Losses

The z-loss (from PaLM paper) penalises the logit sum to prevent excessive concentration of probability on a few tokens. This improves stability and calibration during training.

# Z-loss regularisation def compute_loss_with_z_loss(logits, targets, z_loss_weight=0.0): """Compute cross-entropy loss with z-loss regularisation.""" ce_loss = F.cross_entropy(logits, targets) # Z-loss: penalise large logit sums logit_sum = torch.logsumexp(logits, dim=-1) z_loss = logit_sum ** 2 z_loss = z_loss.mean() total_loss = ce_loss + z_loss_weight * z_loss return total_loss
Tools & Libraries

Regularisation Tools and Libraries

Framework
PyTorch
Built-in Dropout, weight_decay in optimizers, and learning rate schedulers. Foundation for all regularisation techniques.
Augmentation
torchvision
Standard image transforms: RandomCrop, RandomRotation, ColorJitter, Resize. Easy composition with Compose().
Augmentation
albumentations
Advanced image augmentation with GPU support. Faster and more flexible than torchvision for complex pipelines.
Augmentation
nlpaug
Text augmentation: back-translation, word replacement, synonym substitution. Easy integration with training pipelines.
Hyperparameter
Optuna
Hyperparameter optimization. Automatically search for best weight_decay, dropout rate, and learning rate schedule.
Training
Lightning
High-level training framework with built-in early stopping, learning rate schedulers, and mixed precision.
Further Reading

References

Research Papers
Documentation & Guides
Practitioner Writing