Learning rate scheduling β cosine decay, warmup, and cyclical strategies that control training dynamics and final model quality.
A fixed learning rate is almost never optimal. Start too high and training diverges; keep it high and the model oscillates around the minimum but never converges. Scheduling solves this: ramp up carefully, then decay smoothly.
| Scenario | Schedule | Typical LR | Warmup |
|---|---|---|---|
| LLM pre-training | Cosine + warmup | 3e-4 | 2000 steps |
| Supervised fine-tuning | Cosine + warmup | 2e-5 | 100 steps |
| RLHF/DPO | Constant or cosine | 1e-6 to 5e-6 | 50β100 steps |
| LoRA fine-tuning | Cosine + warmup | 2e-4 | 10% of steps |
| Embedding training | Linear decay | 2e-5 | 10% |
Learning rate warmup prevents training instability early in optimization when gradient estimates are noisy. Warmup duration typically scales with model size: small models use 100-500 steps, while 70B+ parameter models benefit from 5000-10000 warmup steps to establish stable gradient flow.
# Custom warmup + cosine schedule
from torch.optim.lr_scheduler import LambdaLR
import math
def lr_lambda(current_step: int, num_warmup_steps=500, num_total_steps=10000):
if current_step < num_warmup_steps:
return float(current_step) / float(max(1, num_warmup_steps))
return max(0.0, math.cos(math.pi * 0.5 * (current_step - num_warmup_steps) / (num_total_steps - num_warmup_steps)))
scheduler = LambdaLR(optimizer, lr_lambda)
Modern training pipelines often use dynamic scheduling that adjusts based on validation metrics or loss plateau detection. Techniques like ReduceLROnPlateau and warm restarts (SGDR) help escape local minima and achieve better final model performance compared to fixed schedules.
Modern machine learning training dynamics benefit significantly from sophisticated learning rate scheduling that adapts to training phases. Early-stage training with high learning rates enables rapid loss descent, while later stages with lower learning rates enable fine-tuning of learned representations. The warmup phase typically spans 5-10% of total training steps, during which the learning rate increases linearly from 0 to the peak value. This prevents catastrophic gradient spikes that can corrupt learned representations. Following warmup, decay schedules (linear, exponential, cosine, or polynomial) gradually reduce the learning rate. Cosine annealing with warm restarts (SGDR) periodically resets learning rates to encourage escape from local minima while maintaining overall descent trajectory. For language model training specifically, research shows that learning rate schedules interact strongly with batch size and weight decayβlarger batch sizes require higher peak learning rates (scaled by sqrt(batch_size)), while weight decay effectiveness increases with learning rate. Practical guidelines for transformer training: use peak LR in range [0.0001, 0.001] depending on batch size, warmup for 5000-10000 steps regardless of total training steps, use cosine annealing with final LR = peak LR / 100, and combine with gradient clipping (norm=1.0) for stability.
Different model architectures respond differently to learning rate schedules. Transformer models (BERT, GPT, T5) benefit from longer warmup: 10% of total steps instead of 5% for CNNs. Vision transformers require even more aggressive warmup: 15-20 steps per sample in batch due to training instability from self-attention mechanisms. Recurrent networks (LSTMs, GRUs) are sensitive to learning rate spikes and benefit from shorter, more conservative peak learning rates. Graph neural networks exhibit complex loss landscapes requiring careful warmup and decay, often benefiting from cycling schedules (multiple restarts) rather than monotonic decay. Empirical observations across 100M+ model trainings show: cosine annealing outperforms polynomial decay by 2-3% final accuracy, warmup prevents 10-15% accuracy degradation on large models, learning rate scaling with batch size sqrt rule improves generalization by avoiding small-batch training artifacts. The interaction between learning rate and weight decay is significant: small learning rates (1e-5) work with large weight decay (0.1), while large learning rates (1e-3) require small weight decay (0.01) to prevent oscillation. Modern practice uses adaptive schedules based on gradient statistics: if loss variance is high, reduce learning rate; if gradient norms spike, trigger early stopping. This trend toward dynamic, metric-aware scheduling represents the frontier of optimization research, showing promise for reducing hyperparameter tuning burden.
Fine-tuning pretrained models requires different learning rate schedules than training from scratch. Pretrained features are already high-quality, and large learning rates destroy learned representations. Standard practice: use 0.1-0.001x of original training peak learning rate (e.g., if pretraining used 1e-3, fine-tuning uses 1e-4 to 1e-5). Warmup duration can be shorter: 500-1000 steps sufficient versus 5000+ for training from scratch. Layer-wise learning rate decay (reducing learning rates for earlier layers more aggressively) improves fine-tuning accuracy by 1-2% across vision and NLP tasks. Discriminative fine-tuning: learning rates for bottom layers << middle layers << top layers, reflecting that bottom layers learn general features while top layers learn task-specific patterns. Multi-stage fine-tuning: first stages train only top layers with high learning rates, later stages gradually unfreeze and fine-tune middle and bottom layers with progressively lower rates. Analysis of 10K+ fine-tuning experiments shows this approach reduces catastrophic forgetting (performance drop on original pretraining task) from 20-30% to <5%. Integration with learning rate schedulers: use longer constant-LR plateau (no decay) for fine-tuning to avoid excessive regularization, apply decay only in final 10% of fine-tuning budget.
Learning rate schedules interact strongly with batch size, model architecture, and optimization algorithm. The sqrt scaling rule (LR β sqrt(batch_size)) helps large batch training maintain convergence: batch size 256 uses peak_lr, batch size 1024 uses sqrt(4) Γ peak_lr = 2 Γ peak_lr. Rationale: larger batches provide better gradient estimates but smaller per-example learning signal, requiring proportional learning rate increase. Empirical validation across 50K+ training runs shows sqrt scaling reduces total number of optimization steps to convergence by ~1.5x despite modest learning rate increase. Interaction with weight decay: large learning rates (1e-3) combined with large weight decay (0.1) causes oscillation and divergence, requiring reduction to (weight decay 0.01) or learning rate reduction. Layered learning rates (differential learning rates per layer) improve convergence: bottom layers (ResNet blocks 1-2) get 0.1x learning rate, middle layers 0.5x, top layers 1.0x. Justification: bottom layers learn general features (edges, textures) that should change slowly to preserve knowledge, top layers learn task-specific features requiring faster adaptation. Empirical results: layered learning rates improve ImageNet accuracy 0.5-1.0% and downstream task performance 2-5%. Temperature in learning rate: some schedules include temperature parameter scaling entire LR schedule. Temperature 1.0 is baseline, <1.0 reduces aggressiveness (longer convergence), >1.0 increases aggressiveness (faster convergence but risk divergence).
Hyperparameter co-optimization reveals non-obvious interactions affecting convergence. Learning rate and weight decay interact multiplicatively: model trained with LR equal to 0.001 and WD equal to 0.01 differs significantly from LR equal to 0.0001 and WD equal to 0.1 despite similar mathematical decay. Empirical studies show LR-WD interaction explains 30-40% of variance in final accuracy across 1000+ training runs. Batch size couples with learning rate: doubling batch size allows sqrt(2)x learning rate increase without hurting convergence. Warmup duration couples with total steps: 10 percent warmup for 100K steps works, but 100 percent warmup for 10K steps fails. Learning rate couples with model size: larger models benefit from larger learning rates due to noisier per-parameter gradients. Vision transformers require 10-50x warmup compared to CNN due to unstable self-attention early gradients. Decoder-only models like GPT require different schedules than encoder-decoder like T5. Automated hyperparameter tuning via Bayesian optimization, population-based training, or evolutionary algorithms optimizes for specific dataset. Practical transfer learning: use documented hyperparameters from similar models, reduce learning rate 5-10x for fine-tuning, tune learning rate and batch size as highest impact parameters.