01 — Fundamentals
The Overfitting Problem
Overfitting occurs when a model memorises training data instead of learning generalizable patterns. The training loss decreases while validation loss increases — a telltale sign. The underlying cause is that models have enough capacity to fit noise in the training set.
Bias-Variance Tradeoff
Every model exhibits bias (underfitting due to insufficient capacity) and variance (overfitting due to fitting noise). Regularisation reduces variance at a small cost to bias. The goal is to find the sweet spot that minimizes test error.
💡
Regularisation is not always beneficial. On small datasets with limited capacity, adding regularisation helps. On massive datasets with weak priors, regularisation can hurt. Always measure on a held-out test set.
When Regularisation Helps vs Hurts
Regularisation helps when: training loss is much lower than validation loss, you have a small dataset relative to model capacity, or you want to enforce prior knowledge. Regularisation hurts when: you're underfitting (both losses are high), your dataset is enormous, or your prior is wrong (e.g., L2 regularisation on a scale-invariant model).
# Diagnosing overfitting
import matplotlib.pyplot as plt
epochs = list(range(100))
train_losses = [...] # from training loop
val_losses = [...] # from validation
plt.plot(epochs, train_losses, label='Train')
plt.plot(epochs, val_losses, label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
# Overfitting signature:
# - Train loss: ↓↓↓ (keeps decreasing)
# - Val loss: ↓↓ then ↑↑ (increases after convergence)
02 — Penalty Term
L1, L2, and Weight Decay
Weight decay penalises large weights. L2 weight decay (Ridge regression) adds a penalty term proportional to the sum of squared weights. L1 weight decay (Lasso) penalises the sum of absolute values, encouraging sparsity. Both prevent the model from relying on any single feature too heavily.
L2 vs L1 Comparison
| Method | Penalty Formula | Effect on Weights | Sparsity | Use Case |
| L2 (Ridge) | λ × Σ(w²) | Small, distributed | No | Default; smooth regularisation |
| L1 (Lasso) | λ × Σ(|w|) | Many zeros, few large | Yes | Feature selection; interpretability |
| Elastic Net | λ₁×Σ(|w|) + λ₂×Σ(w²) | Hybrid: zeros + small | Partial | Combined L1+L2 benefits |
Weight Decay in PyTorch
In PyTorch, weight decay is specified via the optimizer. Most optimizers support the weight_decay parameter, which implements L2 regularisation. Important: weight decay in modern optimizers (AdamW) differs slightly from classical L2 — it's applied directly to weights, not gradients.
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(10, 5)
# L2 weight decay (Ridge)
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# Training step
loss = criterion(model(x), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# weight_decay is applied: w ← w - lr * (grad + weight_decay * w)
# Manual L2 regularisation (for comparison)
l2_lambda = 1e-4
l2_loss = sum(torch.sum(p ** 2) for p in model.parameters())
total_loss = loss + l2_lambda * l2_loss
Choosing Lambda
The weight decay coefficient λ controls regularisation strength. Too small: regularisation has no effect. Too large: the model underfits. Common practice is to grid search λ ∈ {1e-5, 1e-4, 1e-3, 1e-2} and pick the value with the best validation loss. Start at 1e-4 as a default.
03 — Stochastic Regularisation
Dropout: Random Neuron Deactivation
Dropout randomly zeroes activations during training with probability p. It forces the network to learn redundant representations — no single neuron becomes critical. At inference, all neurons are used but scaled down by (1-p) to match expected values. Dropout is free regularisation: no hyperparameter tuning beyond p.
Training vs Inference Mode
This is critical: dropout behaves differently in training and inference. During training, it drops activations randomly. During inference (model.eval()), it's disabled and all activations are used. PyTorch handles this automatically.
import torch
import torch.nn as nn
class DropoutModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(128, 256)
self.dropout = nn.Dropout(p=0.5) # drop 50% of activations
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = self.fc1(x)
x = torch.relu(x)
x = self.dropout(x) # Applied in training; disabled in eval
x = self.fc2(x)
return x
model = DropoutModel()
# Training: dropout is active
model.train()
x = torch.randn(32, 128)
y_train = model(x) # Some activations randomly zeroed
# Inference: dropout is disabled
model.eval()
y_test = model(x) # All activations used, scaled by (1-p)
Dropout Rate Selection
Typical dropout rates: 0.2–0.5. Higher rates (e.g., 0.7) for large models or small datasets. Lower rates (0.1–0.2) for small models or large datasets. DropPath (dropping entire neurons in a layer) is used in vision transformers. Attention dropout is common in LLMs to prevent reliance on specific attention heads.
MC-Dropout for Uncertainty
By keeping dropout active at inference and running multiple forward passes, you get uncertainty estimates. Variations across runs reflect model uncertainty — useful for detecting out-of-distribution inputs.
# MC-Dropout: multiple forward passes with dropout enabled
model.train() # Keep dropout active
num_samples = 10
predictions = []
with torch.no_grad():
for _ in range(num_samples):
logits = model(x_test)
predictions.append(logits)
predictions = torch.stack(predictions) # [num_samples, batch_size, num_classes]
mean_pred = predictions.mean(dim=0)
uncertainty = predictions.std(dim=0)
# High uncertainty suggests OOD example
04 — Target Modification
Label Smoothing: Soft Targets
Label smoothing replaces hard targets (one-hot vectors) with soft targets. Instead of [0, 1, 0], use [0.05, 0.9, 0.05] (for 3 classes). It prevents the model from becoming overconfident and improves calibration — predicted probabilities better reflect actual accuracy.
Effect on Calibration
Hard targets encourage the model to assign probability 1.0 to the correct class, leading to overconfident predictions. Label smoothing encourages spreading probability mass across all classes, resulting in more calibrated probabilities. This is especially important for uncertainty quantification.
import torch
import torch.nn.functional as F
def label_smoothing_loss(logits, targets, epsilon=0.1, num_classes=10):
"""Compute cross-entropy with label smoothing."""
# Create soft targets: 1-epsilon on correct class, epsilon/(K-1) on others
soft_targets = torch.ones_like(logits) * (epsilon / (num_classes - 1))
soft_targets.scatter_(1, targets.unsqueeze(1), 1 - epsilon)
# Compute cross-entropy with soft targets
log_probs = F.log_softmax(logits, dim=1)
loss = -(soft_targets * log_probs).sum(dim=1).mean()
return loss
# Example
logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))
loss = label_smoothing_loss(logits, targets, epsilon=0.1)
Typical Epsilon Values
ε = 0.1 is a common default for most tasks. ε = 0.01–0.05 for tasks requiring high confidence (e.g., medical diagnosis). ε = 0.2–0.3 for datasets with noisy labels. The tradeoff: higher ε reduces overfitting but can hurt accuracy if the dataset is clean.
05 — Synthetic Training Data
Data Augmentation: Creating Diversity
Data augmentation creates new training examples by applying transformations. For images: rotations, crops, color jitter, mixup. For text: back-translation, paraphrasing, word replacement. For both: mixup (interpolating between examples) is powerful. It's free data that improves generalisation.
Image Augmentation
PyTorch's torchvision.transforms provides standard image transforms. Combine multiple transforms in a pipeline. Be careful not to destroy label semantics (e.g., don't flip images where orientation matters, like text).
from torchvision import transforms
# Standard augmentation pipeline
train_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1),
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
val_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
# Use in DataLoader
dataset = MyImageDataset(transform=train_transform)
loader = DataLoader(dataset, batch_size=32)
Text Augmentation
For text: back-translation (translate to another language and back), paraphrasing using language models, and synonym replacement. The nlpaug library provides standard techniques.
Mixup: Interpolating Examples
Mixup creates synthetic examples by interpolating between real examples: x' = λx_i + (1-λ)x_j and y' = λy_i + (1-λ)y_j, where λ ~ Beta(α, α). Simple to implement and often improves generalisation by 1–2% accuracy.
# Mixup augmentation
import torch
import numpy as np
def mixup(x, y, alpha=1.0):
"""Apply mixup to a batch."""
batch_size = x.size(0)
# Sample mixing coefficient from Beta distribution
lam = torch.from_numpy(np.random.beta(alpha, alpha)).float()
# Random permutation for mixing
index = torch.randperm(batch_size)
# Interpolate inputs and targets
mixed_x = lam * x + (1 - lam) * x[index, :]
mixed_y = lam * y + (1 - lam) * y[index]
return mixed_x, mixed_y
# Use in training
for x, y in train_loader:
x, y = mixup(x, y, alpha=1.0)
logits = model(x)
loss = criterion(logits, y)
06 — Dynamic Stopping
Early Stopping and Learning Rate Schedules
Early stopping monitors validation loss and stops training when it plateaus or increases. It prevents overfitting without explicit hyperparameter tuning. Learning rate schedules (cosine annealing, warmup, OneCycleLR) reduce the learning rate over time, allowing finer refinement as training progresses. Both are forms of implicit regularisation.
Early Stopping Implementation
Track validation loss at epoch boundaries. Stop if loss hasn't improved for patience consecutive epochs. Save the best checkpoint and load it at the end.
class EarlyStopping:
def __init__(self, patience=5, min_delta=0):
self.patience = patience
self.min_delta = min_delta
self.counter = 0
self.best_loss = None
self.should_stop = False
def __call__(self, val_loss):
if self.best_loss is None:
self.best_loss = val_loss
elif val_loss < self.best_loss - self.min_delta:
self.best_loss = val_loss
self.counter = 0
else:
self.counter += 1
if self.counter >= self.patience:
self.should_stop = True
# Use in training loop
early_stopping = EarlyStopping(patience=3)
for epoch in range(num_epochs):
train_loss = train_one_epoch(...)
val_loss = evaluate(...)
early_stopping(val_loss)
if early_stopping.should_stop:
print(f"Early stopping at epoch {epoch}")
break
Learning Rate Schedules
Cosine annealing: learning rate starts high, smoothly decreases to zero. Warmup: small learning rate initially, then increases to peak. OneCycleLR: combines both — ramp up, then ramp down. These schedules improve convergence and reduce overfitting.
from torch.optim.lr_scheduler import (
CosineAnnealingLR, LinearLR, OneCycleLR
)
optimizer = optim.SGD(model.parameters(), lr=0.1)
# Cosine annealing: LR → 0 over time
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# Linear warmup + cosine decay (common in transformers)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer,
schedulers=[
LinearLR(optimizer, start_factor=0.1, total_iters=10),
CosineAnnealingLR(optimizer, T_max=90)
],
milestones=[10]
)
# OneCycleLR (used in FastAI)
scheduler = OneCycleLR(
optimizer,
max_lr=0.1,
total_steps=len(train_loader) * num_epochs,
pct_start=0.3,
anneal_strategy='cos'
)
# In training loop
for epoch in range(num_epochs):
for batch in train_loader:
loss = ...
loss.backward()
optimizer.step()
optimizer.zero_grad()
scheduler.step() # Update LR every batch
07 — LLM-Specific Patterns
Regularisation in LLMs
Language models use domain-specific regularisation techniques. Attention dropout and residual dropout prevent overfitting without hurting expressiveness. Weight tying (sharing parameters across layers) reduces parameters while improving regularisation. Z-loss and entropy penalties encourage calibration.
Attention and Residual Dropout
In transformer models, dropout is applied to attention weights and residual connections. This is different from fully-connected layers: it prevents the model from relying on specific attention heads and ensures gradients flow smoothly through deep networks.
# Typical transformer dropout configuration
class TransformerBlock(nn.Module):
def __init__(self, dim, num_heads, dropout=0.1):
super().__init__()
self.attention = nn.MultiheadAttention(dim, num_heads, dropout=dropout)
self.mlp = nn.Sequential(
nn.Linear(dim, 4 * dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(4 * dim, dim),
nn.Dropout(dropout)
)
self.dropout = nn.Dropout(dropout)
self.norm1 = nn.LayerNorm(dim)
self.norm2 = nn.LayerNorm(dim)
def forward(self, x):
# Self-attention with dropout
attn_out, _ = self.attention(x, x, x)
attn_out = self.dropout(attn_out)
x = x + attn_out # Residual connection
x = self.norm1(x)
# MLP with dropout
mlp_out = self.mlp(x)
x = x + mlp_out # Residual connection
x = self.norm2(x)
return x
Z-Loss and Auxiliary Losses
The z-loss (from PaLM paper) penalises the logit sum to prevent excessive concentration of probability on a few tokens. This improves stability and calibration during training.
# Z-loss regularisation
def compute_loss_with_z_loss(logits, targets, z_loss_weight=0.0):
"""Compute cross-entropy loss with z-loss regularisation."""
ce_loss = F.cross_entropy(logits, targets)
# Z-loss: penalise large logit sums
logit_sum = torch.logsumexp(logits, dim=-1)
z_loss = logit_sum ** 2
z_loss = z_loss.mean()
total_loss = ce_loss + z_loss_weight * z_loss
return total_loss
Further Reading
References
Research Papers
- Paper Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR. — jmlr.org ↗
- Paper Szegedy, C. et al. (2016). Rethinking the Inception Architecture for Computer Vision. CVPR. Label smoothing introduced. — arxiv:1512.00567 ↗
- Paper Zhang, H. et al. (2018). mixup: Beyond Empirical Risk Minimization. ICLR. — arxiv:1710.09412 ↗
- Paper Anil, C. et al. (2023). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). Includes regularisation techniques for large models. — arxiv:1910.10683 ↗
Documentation & Guides
Practitioner Writing
- Blog Jeremy Howard. Understanding Deep Learning Overfitting and Regularisation. Fast.ai Practical Deep Learning for Coders course. — course.fast.ai ↗
- Blog Distill. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. — distill.pub ↗