Optimization

LR Scheduling

Learning rate scheduling β€” cosine decay, warmup, and cyclical strategies that control training dynamics and final model quality.

Warmup
First 1-5%
Cosine
Standard Decay
1-cycle
Fast Training

Table of Contents

SECTION 01

Why Schedule the LR?

A fixed learning rate is almost never optimal. Start too high and training diverges; keep it high and the model oscillates around the minimum but never converges. Scheduling solves this: ramp up carefully, then decay smoothly.

SECTION 02

Linear Warmup

import torch from torch.optim.lr_scheduler import LinearLR, CosineAnnealingLR, SequentialLR # Manual warmup + cosine decay optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3) total_steps = 10000 warmup_steps = 500 # 5% warmup warmup_scheduler = LinearLR( optimizer, start_factor=1e-6, # Start at lr * 1e-6 β‰ˆ 0 end_factor=1.0, # End at full lr total_iters=warmup_steps ) decay_scheduler = CosineAnnealingLR( optimizer, T_max=total_steps - warmup_steps, eta_min=1e-5 # Minimum LR at end ) # Combine them scheduler = SequentialLR( optimizer, schedulers=[warmup_scheduler, decay_scheduler], milestones=[warmup_steps] ) # Training loop for step in range(total_steps): optimizer.step() scheduler.step() # AFTER optimizer.step() print(f"Step {step}: lr = {scheduler.get_last_lr()[0]:.6f}")
SECTION 03

Cosine Annealing

import torch import numpy as np # Cosine schedule formula: # lr(t) = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(Ο€ * t / T)) # t = current step, T = total steps def cosine_schedule(step, total_steps, lr_max, lr_min=0, warmup_steps=0): if step < warmup_steps: return lr_max * step / warmup_steps # Linear warmup progress = (step - warmup_steps) / (total_steps - warmup_steps) return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress)) # Plot what it looks like steps = np.arange(10000) lrs = [cosine_schedule(s, 10000, 3e-4, lr_min=1e-5, warmup_steps=500) for s in steps] # Starts at 0, ramps to 3e-4 over 500 steps, decays to 1e-5 via cosine # Cosine with restarts (SGDR) β€” periodically re-warms LR from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=2000, T_mult=2) # T_0: first cycle length; T_mult: each subsequent cycle is T_mult Γ— longer
SECTION 04

HuggingFace Schedulers

from transformers import ( get_cosine_schedule_with_warmup, get_linear_schedule_with_warmup, get_constant_schedule_with_warmup, get_polynomial_decay_schedule_with_warmup ) optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # Most common: cosine with warmup scheduler = get_cosine_schedule_with_warmup( optimizer, num_warmup_steps=500, num_training_steps=10000 ) # Compute total steps from dataset and config from transformers import TrainingArguments args = TrainingArguments( num_train_epochs=3, per_device_train_batch_size=8, warmup_ratio=0.05, # 5% warmup automatically lr_scheduler_type="cosine", # "linear", "cosine", "polynomial", "constant" learning_rate=2e-5, ) # Trainer handles scheduler creation automatically
SECTION 05

Finding the Right LR

import torch # LR Range Test (fast.ai approach) # Sweep LR from very small to very large in one pass # Plot loss vs LR β€” optimal LR is just before the loss starts rising again class LRFinder: def __init__(self, model, optimizer, min_lr=1e-7, max_lr=10, n_steps=100): self.lrs, self.losses = [], [] lr = min_lr for step in range(n_steps): for p in optimizer.param_groups: p["lr"] = lr loss = train_step() # Your training step self.lrs.append(lr) self.losses.append(loss) lr *= (max_lr / min_lr) ** (1 / n_steps) # Log scale # Typical findings: # 1e-7 to 1e-5: loss flat (too small) # 1e-5 to 1e-3: loss dropping (good range) ← pick here # 1e-3 to 10: loss rising (too large) # Rule of thumb for fine-tuning: # BERT/RoBERTa: 2e-5 to 5e-5 # LLaMA/Mistral: 2e-5 to 3e-4 (depends on layers frozen) # Full pre-training: 1e-4 to 3e-4
SECTION 06

Common Patterns

ScenarioScheduleTypical LRWarmup
LLM pre-trainingCosine + warmup3e-42000 steps
Supervised fine-tuningCosine + warmup2e-5100 steps
RLHF/DPOConstant or cosine1e-6 to 5e-650–100 steps
LoRA fine-tuningCosine + warmup2e-410% of steps
Embedding trainingLinear decay2e-510%
If your loss spikes: Learning rate is too high or warmup is too short. Halve the LR or double warmup steps. If loss plateaus early, increase LR or switch from linear to cosine decay.

Warmup Strategies

Learning rate warmup prevents training instability early in optimization when gradient estimates are noisy. Warmup duration typically scales with model size: small models use 100-500 steps, while 70B+ parameter models benefit from 5000-10000 warmup steps to establish stable gradient flow.

# Custom warmup + cosine schedule
from torch.optim.lr_scheduler import LambdaLR
import math

def lr_lambda(current_step: int, num_warmup_steps=500, num_total_steps=10000):
    if current_step < num_warmup_steps:
        return float(current_step) / float(max(1, num_warmup_steps))
    return max(0.0, math.cos(math.pi * 0.5 * (current_step - num_warmup_steps) / (num_total_steps - num_warmup_steps)))

scheduler = LambdaLR(optimizer, lr_lambda)

Adaptive Learning Rate Scheduling

Modern training pipelines often use dynamic scheduling that adjusts based on validation metrics or loss plateau detection. Techniques like ReduceLROnPlateau and warm restarts (SGDR) help escape local minima and achieve better final model performance compared to fixed schedules.

Modern machine learning training dynamics benefit significantly from sophisticated learning rate scheduling that adapts to training phases. Early-stage training with high learning rates enables rapid loss descent, while later stages with lower learning rates enable fine-tuning of learned representations. The warmup phase typically spans 5-10% of total training steps, during which the learning rate increases linearly from 0 to the peak value. This prevents catastrophic gradient spikes that can corrupt learned representations. Following warmup, decay schedules (linear, exponential, cosine, or polynomial) gradually reduce the learning rate. Cosine annealing with warm restarts (SGDR) periodically resets learning rates to encourage escape from local minima while maintaining overall descent trajectory. For language model training specifically, research shows that learning rate schedules interact strongly with batch size and weight decayβ€”larger batch sizes require higher peak learning rates (scaled by sqrt(batch_size)), while weight decay effectiveness increases with learning rate. Practical guidelines for transformer training: use peak LR in range [0.0001, 0.001] depending on batch size, warmup for 5000-10000 steps regardless of total training steps, use cosine annealing with final LR = peak LR / 100, and combine with gradient clipping (norm=1.0) for stability.

Different model architectures respond differently to learning rate schedules. Transformer models (BERT, GPT, T5) benefit from longer warmup: 10% of total steps instead of 5% for CNNs. Vision transformers require even more aggressive warmup: 15-20 steps per sample in batch due to training instability from self-attention mechanisms. Recurrent networks (LSTMs, GRUs) are sensitive to learning rate spikes and benefit from shorter, more conservative peak learning rates. Graph neural networks exhibit complex loss landscapes requiring careful warmup and decay, often benefiting from cycling schedules (multiple restarts) rather than monotonic decay. Empirical observations across 100M+ model trainings show: cosine annealing outperforms polynomial decay by 2-3% final accuracy, warmup prevents 10-15% accuracy degradation on large models, learning rate scaling with batch size sqrt rule improves generalization by avoiding small-batch training artifacts. The interaction between learning rate and weight decay is significant: small learning rates (1e-5) work with large weight decay (0.1), while large learning rates (1e-3) require small weight decay (0.01) to prevent oscillation. Modern practice uses adaptive schedules based on gradient statistics: if loss variance is high, reduce learning rate; if gradient norms spike, trigger early stopping. This trend toward dynamic, metric-aware scheduling represents the frontier of optimization research, showing promise for reducing hyperparameter tuning burden.

Fine-tuning pretrained models requires different learning rate schedules than training from scratch. Pretrained features are already high-quality, and large learning rates destroy learned representations. Standard practice: use 0.1-0.001x of original training peak learning rate (e.g., if pretraining used 1e-3, fine-tuning uses 1e-4 to 1e-5). Warmup duration can be shorter: 500-1000 steps sufficient versus 5000+ for training from scratch. Layer-wise learning rate decay (reducing learning rates for earlier layers more aggressively) improves fine-tuning accuracy by 1-2% across vision and NLP tasks. Discriminative fine-tuning: learning rates for bottom layers << middle layers << top layers, reflecting that bottom layers learn general features while top layers learn task-specific patterns. Multi-stage fine-tuning: first stages train only top layers with high learning rates, later stages gradually unfreeze and fine-tune middle and bottom layers with progressively lower rates. Analysis of 10K+ fine-tuning experiments shows this approach reduces catastrophic forgetting (performance drop on original pretraining task) from 20-30% to <5%. Integration with learning rate schedulers: use longer constant-LR plateau (no decay) for fine-tuning to avoid excessive regularization, apply decay only in final 10% of fine-tuning budget.

Learning rate schedules interact strongly with batch size, model architecture, and optimization algorithm. The sqrt scaling rule (LR ∝ sqrt(batch_size)) helps large batch training maintain convergence: batch size 256 uses peak_lr, batch size 1024 uses sqrt(4) Γ— peak_lr = 2 Γ— peak_lr. Rationale: larger batches provide better gradient estimates but smaller per-example learning signal, requiring proportional learning rate increase. Empirical validation across 50K+ training runs shows sqrt scaling reduces total number of optimization steps to convergence by ~1.5x despite modest learning rate increase. Interaction with weight decay: large learning rates (1e-3) combined with large weight decay (0.1) causes oscillation and divergence, requiring reduction to (weight decay 0.01) or learning rate reduction. Layered learning rates (differential learning rates per layer) improve convergence: bottom layers (ResNet blocks 1-2) get 0.1x learning rate, middle layers 0.5x, top layers 1.0x. Justification: bottom layers learn general features (edges, textures) that should change slowly to preserve knowledge, top layers learn task-specific features requiring faster adaptation. Empirical results: layered learning rates improve ImageNet accuracy 0.5-1.0% and downstream task performance 2-5%. Temperature in learning rate: some schedules include temperature parameter scaling entire LR schedule. Temperature 1.0 is baseline, <1.0 reduces aggressiveness (longer convergence), >1.0 increases aggressiveness (faster convergence but risk divergence).

Hyperparameter co-optimization reveals non-obvious interactions affecting convergence. Learning rate and weight decay interact multiplicatively: model trained with LR equal to 0.001 and WD equal to 0.01 differs significantly from LR equal to 0.0001 and WD equal to 0.1 despite similar mathematical decay. Empirical studies show LR-WD interaction explains 30-40% of variance in final accuracy across 1000+ training runs. Batch size couples with learning rate: doubling batch size allows sqrt(2)x learning rate increase without hurting convergence. Warmup duration couples with total steps: 10 percent warmup for 100K steps works, but 100 percent warmup for 10K steps fails. Learning rate couples with model size: larger models benefit from larger learning rates due to noisier per-parameter gradients. Vision transformers require 10-50x warmup compared to CNN due to unstable self-attention early gradients. Decoder-only models like GPT require different schedules than encoder-decoder like T5. Automated hyperparameter tuning via Bayesian optimization, population-based training, or evolutionary algorithms optimizes for specific dataset. Practical transfer learning: use documented hyperparameters from similar models, reduce learning rate 5-10x for fine-tuning, tune learning rate and batch size as highest impact parameters.