Regularisation Techniques

In this guide

Overfitting Problem
Weight Decay
Dropout
Label Smoothing
Data Augmentation
Early Stopping
LLM Regularisation

01 — Fundamentals

The Overfitting Problem

Overfitting occurs when a model memorises training data instead of learning generalizable patterns. The training loss decreases while validation loss increases — a telltale sign. The underlying cause is that models have enough capacity to fit noise in the training set.

Bias-Variance Tradeoff

Every model exhibits bias (underfitting due to insufficient capacity) and variance (overfitting due to fitting noise). Regularisation reduces variance at a small cost to bias. The goal is to find the sweet spot that minimizes test error.

💡 Regularisation is not always beneficial. On small datasets with limited capacity, adding regularisation helps. On massive datasets with weak priors, regularisation can hurt. Always measure on a held-out test set.

When Regularisation Helps vs Hurts

Regularisation helps when: training loss is much lower than validation loss, you have a small dataset relative to model capacity, or you want to enforce prior knowledge. Regularisation hurts when: you're underfitting (both losses are high), your dataset is enormous, or your prior is wrong (e.g., L2 regularisation on a scale-invariant model).

# Diagnosing overfitting import matplotlib.pyplot as plt epochs = list(range(100)) train_losses = [...] # from training loop val_losses = [...] # from validation plt.plot(epochs, train_losses, label='Train') plt.plot(epochs, val_losses, label='Validation') plt.xlabel('Epoch') plt.ylabel('Loss') plt.legend() plt.show() # Overfitting signature: # - Train loss: ↓↓↓ (keeps decreasing) # - Val loss: ↓↓ then ↑↑ (increases after convergence)

02 — Penalty Term

L1, L2, and Weight Decay

Weight decay penalises large weights. L2 weight decay (Ridge regression) adds a penalty term proportional to the sum of squared weights. L1 weight decay (Lasso) penalises the sum of absolute values, encouraging sparsity. Both prevent the model from relying on any single feature too heavily.

L2 vs L1 Comparison

Method	Penalty Formula	Effect on Weights	Sparsity	Use Case
L2 (Ridge)	λ × Σ(w²)	Small, distributed	No	Default; smooth regularisation
L1 (Lasso)	λ × Σ(\|w\|)	Many zeros, few large	Yes	Feature selection; interpretability
Elastic Net	λ₁×Σ(\|w\|) + λ₂×Σ(w²)	Hybrid: zeros + small	Partial	Combined L1+L2 benefits

Weight Decay in PyTorch

In PyTorch, weight decay is specified via the optimizer. Most optimizers support the weight_decay parameter, which implements L2 regularisation. Important: weight decay in modern optimizers (AdamW) differs slightly from classical L2 — it's applied directly to weights, not gradients.

import torch import torch.nn as nn import torch.optim as optim model = nn.Linear(10, 5) # L2 weight decay (Ridge) optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4) # Training step loss = criterion(model(x), y) optimizer.zero_grad() loss.backward() optimizer.step() # weight_decay is applied: w ← w - lr * (grad + weight_decay * w) # Manual L2 regularisation (for comparison) l2_lambda = 1e-4 l2_loss = sum(torch.sum(p ** 2) for p in model.parameters()) total_loss = loss + l2_lambda * l2_loss

Choosing Lambda

The weight decay coefficient λ controls regularisation strength. Too small: regularisation has no effect. Too large: the model underfits. Common practice is to grid search λ ∈ {1e-5, 1e-4, 1e-3, 1e-2} and pick the value with the best validation loss. Start at 1e-4 as a default.

03 — Stochastic Regularisation

Dropout: Random Neuron Deactivation

Dropout randomly zeroes activations during training with probability p. It forces the network to learn redundant representations — no single neuron becomes critical. At inference, all neurons are used but scaled down by (1-p) to match expected values. Dropout is free regularisation: no hyperparameter tuning beyond p.

Training vs Inference Mode

This is critical: dropout behaves differently in training and inference. During training, it drops activations randomly. During inference (model.eval()), it's disabled and all activations are used. PyTorch handles this automatically.

import torch import torch.nn as nn class DropoutModel(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(128, 256) self.dropout = nn.Dropout(p=0.5) # drop 50% of activations self.fc2 = nn.Linear(256, 10) def forward(self, x): x = self.fc1(x) x = torch.relu(x) x = self.dropout(x) # Applied in training; disabled in eval x = self.fc2(x) return x model = DropoutModel() # Training: dropout is active model.train() x = torch.randn(32, 128) y_train = model(x) # Some activations randomly zeroed # Inference: dropout is disabled model.eval() y_test = model(x) # All activations used, scaled by (1-p)

Dropout Rate Selection

Typical dropout rates: 0.2–0.5. Higher rates (e.g., 0.7) for large models or small datasets. Lower rates (0.1–0.2) for small models or large datasets. DropPath (dropping entire neurons in a layer) is used in vision transformers. Attention dropout is common in LLMs to prevent reliance on specific attention heads.

MC-Dropout for Uncertainty

By keeping dropout active at inference and running multiple forward passes, you get uncertainty estimates. Variations across runs reflect model uncertainty — useful for detecting out-of-distribution inputs.

# MC-Dropout: multiple forward passes with dropout enabled model.train() # Keep dropout active num_samples = 10 predictions = [] with torch.no_grad(): for _ in range(num_samples): logits = model(x_test) predictions.append(logits) predictions = torch.stack(predictions) # [num_samples, batch_size, num_classes] mean_pred = predictions.mean(dim=0) uncertainty = predictions.std(dim=0) # High uncertainty suggests OOD example

04 — Target Modification

Label Smoothing: Soft Targets

Label smoothing replaces hard targets (one-hot vectors) with soft targets. Instead of [0, 1, 0], use [0.05, 0.9, 0.05] (for 3 classes). It prevents the model from becoming overconfident and improves calibration — predicted probabilities better reflect actual accuracy.

Effect on Calibration

Hard targets encourage the model to assign probability 1.0 to the correct class, leading to overconfident predictions. Label smoothing encourages spreading probability mass across all classes, resulting in more calibrated probabilities. This is especially important for uncertainty quantification.

import torch import torch.nn.functional as F def label_smoothing_loss(logits, targets, epsilon=0.1, num_classes=10): """Compute cross-entropy with label smoothing.""" # Create soft targets: 1-epsilon on correct class, epsilon/(K-1) on others soft_targets = torch.ones_like(logits) * (epsilon / (num_classes - 1)) soft_targets.scatter_(1, targets.unsqueeze(1), 1 - epsilon) # Compute cross-entropy with soft targets log_probs = F.log_softmax(logits, dim=1) loss = -(soft_targets * log_probs).sum(dim=1).mean() return loss # Example logits = torch.randn(32, 10) targets = torch.randint(0, 10, (32,)) loss = label_smoothing_loss(logits, targets, epsilon=0.1)

Typical Epsilon Values

ε = 0.1 is a common default for most tasks. ε = 0.01–0.05 for tasks requiring high confidence (e.g., medical diagnosis). ε = 0.2–0.3 for datasets with noisy labels. The tradeoff: higher ε reduces overfitting but can hurt accuracy if the dataset is clean.

05 — Synthetic Training Data

Data Augmentation: Creating Diversity

Data augmentation creates new training examples by applying transformations. For images: rotations, crops, color jitter, mixup. For text: back-translation, paraphrasing, word replacement. For both: mixup (interpolating between examples) is powerful. It's free data that improves generalisation.

Image Augmentation

PyTorch's torchvision.transforms provides standard image transforms. Combine multiple transforms in a pipeline. Be careful not to destroy label semantics (e.g., don't flip images where orientation matters, like text).

from torchvision import transforms # Standard augmentation pipeline train_transform = transforms.Compose([ transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.RandomRotation(15), transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1), transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) ]) val_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) ]) # Use in DataLoader dataset = MyImageDataset(transform=train_transform) loader = DataLoader(dataset, batch_size=32)

Text Augmentation

For text: back-translation (translate to another language and back), paraphrasing using language models, and synonym replacement. The nlpaug library provides standard techniques.

Mixup: Interpolating Examples

Mixup creates synthetic examples by interpolating between real examples: x' = λx_i + (1-λ)x_j and y' = λy_i + (1-λ)y_j, where λ ~ Beta(α, α). Simple to implement and often improves generalisation by 1–2% accuracy.

# Mixup augmentation import torch import numpy as np def mixup(x, y, alpha=1.0): """Apply mixup to a batch.""" batch_size = x.size(0) # Sample mixing coefficient from Beta distribution lam = torch.from_numpy(np.random.beta(alpha, alpha)).float() # Random permutation for mixing index = torch.randperm(batch_size) # Interpolate inputs and targets mixed_x = lam * x + (1 - lam) * x[index, :] mixed_y = lam * y + (1 - lam) * y[index] return mixed_x, mixed_y # Use in training for x, y in train_loader: x, y = mixup(x, y, alpha=1.0) logits = model(x) loss = criterion(logits, y)

06 — Dynamic Stopping

Early Stopping and Learning Rate Schedules

Early stopping monitors validation loss and stops training when it plateaus or increases. It prevents overfitting without explicit hyperparameter tuning. Learning rate schedules (cosine annealing, warmup, OneCycleLR) reduce the learning rate over time, allowing finer refinement as training progresses. Both are forms of implicit regularisation.

Early Stopping Implementation

Track validation loss at epoch boundaries. Stop if loss hasn't improved for patience consecutive epochs. Save the best checkpoint and load it at the end.

class EarlyStopping: def __init__(self, patience=5, min_delta=0): self.patience = patience self.min_delta = min_delta self.counter = 0 self.best_loss = None self.should_stop = False def __call__(self, val_loss): if self.best_loss is None: self.best_loss = val_loss elif val_loss < self.best_loss - self.min_delta: self.best_loss = val_loss self.counter = 0 else: self.counter += 1 if self.counter >= self.patience: self.should_stop = True # Use in training loop early_stopping = EarlyStopping(patience=3) for epoch in range(num_epochs): train_loss = train_one_epoch(...) val_loss = evaluate(...) early_stopping(val_loss) if early_stopping.should_stop: print(f"Early stopping at epoch {epoch}") break

Learning Rate Schedules

Cosine annealing: learning rate starts high, smoothly decreases to zero. Warmup: small learning rate initially, then increases to peak. OneCycleLR: combines both — ramp up, then ramp down. These schedules improve convergence and reduce overfitting.

from torch.optim.lr_scheduler import ( CosineAnnealingLR, LinearLR, OneCycleLR ) optimizer = optim.SGD(model.parameters(), lr=0.1) # Cosine annealing: LR → 0 over time scheduler = CosineAnnealingLR(optimizer, T_max=100) # Linear warmup + cosine decay (common in transformers) scheduler = torch.optim.lr_scheduler.SequentialLR( optimizer, schedulers=[ LinearLR(optimizer, start_factor=0.1, total_iters=10), CosineAnnealingLR(optimizer, T_max=90) ], milestones=[10] ) # OneCycleLR (used in FastAI) scheduler = OneCycleLR( optimizer, max_lr=0.1, total_steps=len(train_loader) * num_epochs, pct_start=0.3, anneal_strategy='cos' ) # In training loop for epoch in range(num_epochs): for batch in train_loader: loss = ... loss.backward() optimizer.step() optimizer.zero_grad() scheduler.step() # Update LR every batch

07 — LLM-Specific Patterns

Regularisation in LLMs

Language models use domain-specific regularisation techniques. Attention dropout and residual dropout prevent overfitting without hurting expressiveness. Weight tying (sharing parameters across layers) reduces parameters while improving regularisation. Z-loss and entropy penalties encourage calibration.

Attention and Residual Dropout

In transformer models, dropout is applied to attention weights and residual connections. This is different from fully-connected layers: it prevents the model from relying on specific attention heads and ensures gradients flow smoothly through deep networks.

# Typical transformer dropout configuration class TransformerBlock(nn.Module): def __init__(self, dim, num_heads, dropout=0.1): super().__init__() self.attention = nn.MultiheadAttention(dim, num_heads, dropout=dropout) self.mlp = nn.Sequential( nn.Linear(dim, 4 * dim), nn.ReLU(), nn.Dropout(dropout), nn.Linear(4 * dim, dim), nn.Dropout(dropout) ) self.dropout = nn.Dropout(dropout) self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) def forward(self, x): # Self-attention with dropout attn_out, _ = self.attention(x, x, x) attn_out = self.dropout(attn_out) x = x + attn_out # Residual connection x = self.norm1(x) # MLP with dropout mlp_out = self.mlp(x) x = x + mlp_out # Residual connection x = self.norm2(x) return x

Z-Loss and Auxiliary Losses

The z-loss (from PaLM paper) penalises the logit sum to prevent excessive concentration of probability on a few tokens. This improves stability and calibration during training.

# Z-loss regularisation def compute_loss_with_z_loss(logits, targets, z_loss_weight=0.0): """Compute cross-entropy loss with z-loss regularisation.""" ce_loss = F.cross_entropy(logits, targets) # Z-loss: penalise large logit sums logit_sum = torch.logsumexp(logits, dim=-1) z_loss = logit_sum ** 2 z_loss = z_loss.mean() total_loss = ce_loss + z_loss_weight * z_loss return total_loss

Tools & Libraries

Regularisation Tools and Libraries

Framework

PyTorch

Built-in Dropout, weight_decay in optimizers, and learning rate schedulers. Foundation for all regularisation techniques.

Augmentation

torchvision

Standard image transforms: RandomCrop, RandomRotation, ColorJitter, Resize. Easy composition with Compose().

Augmentation

albumentations

Advanced image augmentation with GPU support. Faster and more flexible than torchvision for complex pipelines.

Augmentation

nlpaug

Text augmentation: back-translation, word replacement, synonym substitution. Easy integration with training pipelines.

Hyperparameter

Optuna

Hyperparameter optimization. Automatically search for best weight_decay, dropout rate, and learning rate schedule.

Training

Lightning

High-level training framework with built-in early stopping, learning rate schedulers, and mixed precision.

References

Research Papers

Paper Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR. — jmlr.org ↗
Paper Szegedy, C. et al. (2016). Rethinking the Inception Architecture for Computer Vision. CVPR. Label smoothing introduced. — arxiv:1512.00567 ↗
Paper Zhang, H. et al. (2018). mixup: Beyond Empirical Risk Minimization. ICLR. — arxiv:1710.09412 ↗
Paper Anil, C. et al. (2023). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). Includes regularisation techniques for large models. — arxiv:1910.10683 ↗

Documentation & Guides

Docs PyTorch Optimizers. pytorch.org/docs/optim ↗ — Weight decay parameter in all optimizers
Docs PyTorch Learning Rate Schedulers. pytorch.org/docs/optim ↗
Guide Albumentations Documentation. albumentations.ai ↗ — Advanced image augmentation
Guide Fast.ai Regularisation Handbook. github.com/fastai ↗

Practitioner Writing

Blog Jeremy Howard. Understanding Deep Learning Overfitting and Regularisation. Fast.ai Practical Deep Learning for Coders course. — course.fast.ai ↗
Blog Distill. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. — distill.pub ↗

Regularisation Techniques

The Overfitting Problem

Bias-Variance Tradeoff

When Regularisation Helps vs Hurts

L1, L2, and Weight Decay

L2 vs L1 Comparison

Weight Decay in PyTorch

Choosing Lambda

Dropout: Random Neuron Deactivation

Training vs Inference Mode

Dropout Rate Selection

MC-Dropout for Uncertainty

Label Smoothing: Soft Targets

Effect on Calibration

Typical Epsilon Values

Data Augmentation: Creating Diversity

Image Augmentation

Text Augmentation

Mixup: Interpolating Examples

Early Stopping and Learning Rate Schedules

Early Stopping Implementation

Learning Rate Schedules

Regularisation in LLMs

Attention and Residual Dropout

Z-Loss and Auxiliary Losses

Regularisation Tools and Libraries

References

Related concepts