Regularization

Weight Decay

L2 regularization via AdamW that prevents overfitting by penalizing large weights — decoupled correctly from the Adam adaptive learning rate.

λ=0.1
Pre-training
λ=0.01
Fine-tuning
AdamW
Correct Impl.

Table of Contents

SECTION 01

What Weight Decay Does

Weight decay adds a penalty term λ‖W‖² to the loss, pushing weights toward zero during training. Small weights = smoother functions = better generalization. It's the most consistently useful regularizer in deep learning.

import torch # Without weight decay: minimize L(W) # With weight decay: minimize L(W) + λ/2 * ||W||² # # Gradient: ∂/∂W (L + λ/2 * ||W||²) = ∂L/∂W + λW # Update: W ← W - α * (∂L/∂W + λW) # = W * (1 - αλ) - α * ∂L/∂W # # The (1 - αλ) term "decays" weights toward zero each step # With α=1e-3, λ=0.1: weights decay by 0.0001 per step # Effect: prevents any individual weight from becoming very large # Large weights → model is sensitive to small input changes → overfitting # Small weights → smooth, generalized decision boundaries
SECTION 02

Adam vs AdamW

import torch # Adam (incorrect weight decay): # Gradient = ∂L/∂W + λW # Adam update uses adaptive scaling: W -= lr * (m_hat / (sqrt(v_hat) + ε)) # For weights with large gradients, v_hat is large → effective WD is small # → Weight decay is inconsistently applied! # AdamW (correct weight decay): # Gradient update: W -= lr * m_hat / (sqrt(v_hat) + ε) [normal Adam step] # Weight decay: W -= lr * λ * W [separate step] # → Weight decay is ALWAYS lr * λ, regardless of gradient history # Always use AdamW: optimizer_bad = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.1) # Wrong optimizer_good = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.1) # Correct # The difference matters most when: # - Using large weight_decay (≥ 0.01) # - Training for many steps # - Parameters have varying gradient magnitudes
SECTION 03

What to Regularize

import torch import torch.nn as nn # Not all parameters should have weight decay: # - Weight matrices (nn.Linear.weight): YES — regularize these # - Bias terms (nn.Linear.bias): NO — biases shift the function, regularizing harms training # - LayerNorm weights and biases: NO — they're scale/shift parameters, not capacity # - Embedding weights: debatable — often NO for pre-trained, sometimes YES for scratch def get_optimizer(model, lr=1e-4, weight_decay=0.1): """AdamW with weight decay only on weight matrices.""" no_wd = {"bias", "norm", "layernorm", "rmsnorm", "embedding"} decay_params, no_decay_params = [], [] for name, param in model.named_parameters(): if not param.requires_grad: continue if any(nd in name.lower() for nd in no_wd): no_decay_params.append(param) else: decay_params.append(param) return torch.optim.AdamW([ {"params": decay_params, "weight_decay": weight_decay}, {"params": no_decay_params, "weight_decay": 0.0} ], lr=lr)
SECTION 04

Hyperparameter Tuning

Use Caseweight_decayNotes
LLM pre-training from scratch0.1GPT-3, LLaMA standard
Full fine-tuning0.01–0.1Lower than pre-training; model already regularized
LoRA fine-tuning0.0–0.01Only ~1M params — WD can hurt learning
Classification head0.01New head on frozen backbone
Very small datasets (<1k)0.1–1.0Aggressively regularize to prevent memorization
Embedding training0.01Standard for contrastive learning
SECTION 05

Combining with Other Regularization

import torch.nn as nn # Weight decay + Dropout: complementary # Weight decay: penalizes large weights (magnitude regularization) # Dropout: prevents co-adaptation of neurons (structural regularization) # For LLM fine-tuning on small dataset: optimizer = torch.optim.AdamW( get_optimizer(model), lr=2e-5, weight_decay=0.1 # Strong L2 regularization ) # Add dropout in the classification head head = nn.Sequential(nn.Dropout(0.1), nn.Linear(768, 2)) # Don't combine too aggressively: # High WD + high dropout + small LR → underfitting # Rule: one aggressive regularizer is usually enough # Data augmentation is often better than tuning WD: # LLM augmentation: paraphrase, back-translation, synonym replacement
SECTION 06

Practical Code

from transformers import Trainer, TrainingArguments # The simplest way — just set weight_decay in TrainingArguments args = TrainingArguments( output_dir="./output", weight_decay=0.01, # Applied to all params except bias/norm learning_rate=2e-5, num_train_epochs=3, per_device_train_batch_size=16, ) # HuggingFace Trainer automatically excludes bias and norm params # Diagnosing overfitting: def monitor_regularization(train_losses, val_losses, window=10): """Check if val loss diverging from train loss.""" train_smooth = sum(train_losses[-window:]) / window val_smooth = sum(val_losses[-window:]) / window gap = val_smooth - train_smooth if gap > 0.5: print(f"Warning: large train/val gap ({gap:.3f}) — consider increasing weight_decay or dropout") elif gap < 0.0: print(f"Val loss below train loss — possible eval data leak, or underfitting") return gap
Starting point: AdamW with weight_decay=0.01 for fine-tuning, 0.1 for pre-training. Exclude bias, norm, and embedding params. Monitor the train/val loss gap.
SECTION 07

When Weight Decay Helps vs Hurts

Weight decay acts as a form of regularization by penalizing large weights. However, it's not universally beneficial. Biases (especially in normalization layers) often shouldn't be decayed because they don't contribute to overfitting in the same way as weights do. Modern optimizers like AdamW decouple weight decay from the gradient-based update.

SECTION 08

L2 vs L1 Regularization vs Weight Decay

In traditional machine learning, L2 regularization (weight decay in SGD) adds a penalty term to the loss. In deep learning with adaptive optimizers like Adam, weight decay as a direct penalty can interact poorly with the adaptive learning rates, which is why AdamW separates them. Understanding this distinction prevents silent performance regressions when switching optimizers.

Weight decay as regularization: Weight decay penalizes the magnitude of weights, pushing them toward zero. In standard SGD with L2 regularization, the update rule becomes: w := w - lr * (grad + lambda * w), where lambda is the decay factor. This is mathematically equivalent to adding a L2 penalty to the loss function. However, with adaptive optimizers like Adam, the interaction between momentum and weight decay becomes subtle—this is why AdamW decouples them.

Different layer types have different regularization needs. Biases in normalization layers (batch norm, layer norm) rarely benefit from weight decay because they're not part of the feature transformation. Embedding matrices in language models also typically don't use weight decay. Selective weight decay—applying it only to certain parameter groups—often outperforms uniform weight decay across all parameters. PyTorch's parameter groups feature (in optimizers) makes this straightforward.

Hyperparameter tuning for weight decay requires patience. Common values range from 0.01 to 0.1 for AdamW. Too high and the model underfits; too low and overfitting dominates. Combining weight decay with dropout, mixup, or other regularization techniques requires careful balancing to avoid over-regularization. Modern practice often uses relatively high weight decay (0.05-0.1) combined with warmup schedules to stabilize early training.

EXTRA

Weight Decay in Optimization Theory

From an optimization perspective, weight decay is a form of structural regularization that biases the optimizer toward simpler models (models with smaller weights). In classical machine learning, this regularization is often justified by generalization bounds—simpler models have lower VC dimension and are less likely to overfit. In deep learning, the relationship between weight magnitude and generalization is more nuanced and dataset-dependent.

The "lottery ticket hypothesis" suggests that neural networks contain subnetworks capable of learning to the same accuracy as the full network when trained in isolation. Weight decay influences which tickets are drawn during training by penalizing complex weight patterns. This perspective connects classical regularization to modern insights about neural network structure and trainability.

Adaptive weight decay schedules (reducing decay as training progresses) can outperform fixed decay rates. Early in training, strong decay prevents feature learning; later, it helps refine and compress learned representations. Research on curriculum learning and scheduling provides frameworks for principled choices of decay schedules, though empirical tuning remains standard practice.

Transfer learning interacts with weight decay in non-obvious ways. When fine-tuning a pretrained model, should you use the same weight decay as pretraining? Empirically, reducing weight decay for fine-tuning often works better because the pretrained weights already represent compressed knowledge. Applying strong additional decay could destroy useful features. Selective decay (different rates for different layers) becomes important in transfer scenarios.

Debugging whether weight decay is helping or hurting requires careful ablations. Train the same model with and without weight decay, same learning rate and batch size, and compare validation metrics. Sometimes weight decay hurts! This happens when the task benefits from large weights (e.g., memorization in small datasets) or when regularization is already provided by other mechanisms like dropout. Blind application of common recipes is dangerous.

In adversarial training and robustness contexts, weight decay interacts with attack generation and defense. Stronger regularization can paradoxically make models less robust to adversarial examples. The relationship between weight magnitude and adversarial robustness remains an active research area. Practitioners should empirically verify that weight decay choices improve robustness metrics, not just clean accuracy.

BOOST

Weight Decay and Model Compression

Weight decay indirectly enables model compression by penalizing weight magnitude. After training with strong weight decay, many weights are small and can be pruned with minimal accuracy loss. This inductive bias toward sparse, compressible models is sometimes intentional—trainers use weight decay explicitly to prepare models for subsequent pruning or quantization. Understanding this connection explains why weight decay improves not just generalization but also model efficiency.

Magnitude pruning (removing smallest-magnitude weights) works well on models trained with weight decay because they naturally develop sparse structure. Models trained without weight decay have more uniform weight magnitudes, making magnitude pruning less effective. This provides another reason to use weight decay: it improves both generalization and compressibility, creating win-win scenarios in practice.

Structured pruning (removing entire neurons or filters) interacts differently with weight decay. When entire structures are pruned, remaining weights must compensate, potentially requiring higher magnitudes. Weight decay in this context is a trade-off—too strong and remaining weights are penalized for compensating; too weak and overfitting dominates. Balancing these objectives requires empirical tuning specific to the pruning strategy and application.