Empirical power-law relationships between model parameters, training data, compute budget, and downstream performance that guide the design and training of large language models at scale.
Scaling laws describe empirical power-law relationships between model size (number of parameters, N), dataset size (number of training tokens, D), compute budget (C, measured in FLOPs), and model performance (loss or downstream task accuracy). The key insight: loss decreases predictably as you scale model size, data, or compute, following a power law: Loss ≈ C * N^(-α), where α is the scaling exponent (typically 0.05-0.1) and C is a constant.
Scaling laws have profound implications. They suggest that: (1) Larger models and more data consistently improve performance—no wall hit, at least at current scales. (2) Performance is predictable: you can forecast downstream task accuracy by measuring loss on a smaller model. (3) Compute is the bottleneck: for a fixed compute budget, there's an optimal split between model size and training data. (4) Efficiency gains persist: past results continue to hold even at much larger scales.
The field has moved from trying to understand individual model architectures to asking: "How do the fundamental quantities (N, D, C) relate to performance?" This shift has enabled strategic planning: estimate your compute budget, use scaling laws to predict optimal N and D, and know roughly how well the model will perform before training.
In "Scaling Laws for Neural Language Models," OpenAI researchers trained models ranging from 10M to 10B parameters on datasets up to 400B tokens. They discovered three independent scaling laws:
1. Model Size (N)
Loss decreases with model size following: Loss_N = (N_0 / N)^α_N, where α_N ≈ 0.076. In practice: doubling model size improves loss by ~5-6%.
2. Dataset Size (D)
Loss decreases with dataset size following: Loss_D = (D_0 / D)^α_D, where α_D ≈ 0.095. Doubling dataset size improves loss by ~6-7%.
3. Compute (C)
Loss decreases with compute budget (measured in FLOPs) following: Loss_C = (C_0 / C)^α_C, where α_C ≈ 0.06. The relationship is weaker than N or D individually, but important for planning.
Combined Scaling Law
When all three factors are optimized simultaneously: Loss ≈ E + (A / N^α) + (B / D^β) + (C * L^γ), where E is irreducible loss (vocabulary effects, task difficulty), A, B, C are constants, and α, β, γ are exponents. The constants and exponents vary by task and model class.
Implications: (1) No mysterious wall—loss improves smoothly as you scale. (2) The exponents are consistent across tasks and model families (Transformers with different architectures show similar exponents). (3) You can transfer knowledge: a scaling law fit on small models predicts performance on larger models.
DeepMind's follow-up work, "Training Compute-Optimal Large Language Models," challenged Kaplan et al.'s conclusion that larger models need fewer tokens. Hoffman et al. trained models from 400M to 70B parameters, each on different token budgets (20T to 1T tokens), and fit their own scaling laws.
Key Finding: The 20× Rule
For compute-optimal training, model size (N) and data size (D) should scale equally: D ≈ 20 * N (measured in tokens). In other words, train each model on ~20 tokens per parameter. This is roughly 10× more data than Kaplan et al. recommended, and it yields significantly better downstream performance.
Compute-Optimal Exponents
Chinchilla fit: Loss = E + A / N^α + B / D^β, with α ≈ 0.050 and β ≈ 0.050 (symmetric). This symmetry is intuitive: model size and data size contribute equally to reducing loss. The optimal allocation balances them.
Practical Impact
A 70B model trained on 1.4T tokens (Chinchilla compute budget) outperforms GPT-3 (175B parameters, 300B tokens) while using less compute. This enabled models like Chinchilla, LLaMA, and Mistral: more compute-efficient than GPT-3, enabling efficient inference and faster iteration.
A surprising phenomenon: some capabilities appear suddenly at scale—models below a certain size perform at chance, then above a threshold, accuracy jumps to near-human. Examples: in-context learning (few-shot prompting), chain-of-thought reasoning, and mathematical problem-solving show emergent properties. This raises questions: Are emergent abilities real, or artifacts of evaluation metrics?
The Debate: Some researchers argue emergence is real—there are threshold phenomena where increasing scale unlocks new capabilities. Others (notably Schaeffer et al., 2024) argue emergence is primarily a measurement artifact: if you plot log-loss vs log-model-size, all curves are smooth power laws. The appearance of sudden emergence depends on the specific metric (accuracy on a binary classification task, for example, is a step function even for smooth underlying loss).
Evidence for Emergent Abilities: Few-shot in-context learning improves dramatically at scale. Chain-of-thought reasoning (asking the model to "think step by step") works well for larger models but not smaller ones. Solving complex math problems shows sharp improvements above 100B parameters.
Skeptical View: If you measure loss (not binary accuracy), large models and small models show consistent improvement—no sudden jump. The emergence is an artifact of thresholding: a task has a minimum loss required for success, and crossing that threshold looks sudden. But the underlying loss improves smoothly.
In practice, both views are useful. Plan for smooth scaling of loss, but be aware that specific downstream tasks (especially those with discrete success criteria) will show emergent-looking behavior at certain scales.
Step 1: Estimate Compute Budget
How many FLOPs do you have? A training run on 8 H100 GPUs for 30 days, with peak throughput of ~1.5e15 FLOPs/GPU/second, yields roughly C = 8 * 1.5e15 * (30 * 86400) ≈ 3e24 FLOPs. Account for overhead (not all compute is utilized)—typical utilization is 50-70% of peak.
Step 2: Apply Chinchilla Formula
Given C, compute: N_optimal ≈ sqrt(C / 6), D_optimal ≈ 20 * N_optimal. For C = 3e24: N ≈ 70B params, D ≈ 1.4T tokens.
Step 3: Validate Against Empirical Loss Curves
Fit scaling laws using small pilot runs (e.g., train a 1B model for a few billion tokens). Measure loss and fit: Loss ≈ E + A / N^0.05 + B / D^0.05. Use the fit to predict downstream performance (MMLU, HellaSwag, etc.) for your target model size.
Step 4: Account for Task-Specific Scaling
Scaling laws vary by task. Arithmetic is slower to improve than language modeling. Few-shot learning requires larger models. Adjust your allocation based on your primary use case.
Step 5: Plan Training Duration
For N params and D tokens, effective batch size is key. Typical batch size: 0.5M to 4M tokens per step. Training steps: D / batch_size. A 70B model trained on 1.4T tokens with 2M-token batches = 700k steps. On 8 H100s, ~10k steps per day = 70 days.
Traditional scaling laws apply to training. But recent work on inference-time scaling—models like OpenAI's o1 and DeepSeek-R1—show that test-time compute (how much the model reasons/computes before outputting an answer) also improves performance. This opens a new frontier: instead of only scaling training, scale reasoning.
Chain-of-Thought and Reasoning: Models can use more compute at inference time by generating intermediate reasoning steps (chain-of-thought). The model doesn't just jump to an answer—it thinks through the problem. This longer generation requires more inference compute but improves accuracy, especially on hard reasoning tasks (math, coding, logic puzzles).
Inference Scaling Laws: Preliminary results suggest inference performance improves as a power law with compute: Accuracy ≈ f(N) * g(C_test), where N is model size and C_test is inference-time compute. This suggests even modest-sized models can achieve high accuracy on reasoning tasks with enough test-time compute.
Practical Implications: For reasoning-heavy applications (coding assistance, math tutoring, research automation), allocate budget to inference compute, not just training. A smaller model with longer reasoning might outperform a larger model with direct prediction.
Planning a Training Run: (1) Estimate your compute budget (FLOPs available). (2) Use Chinchilla formula to find N and D. (3) Fit scaling laws on small-scale runs (1-8B parameters on 10-100B tokens). (4) Extrapolate to your target scale. (5) Run the full training and measure downstream task performance. (6) If performance is below target, rerun with updated D and N.
Predicting Downstream Performance: Loss is a proxy for downstream accuracy, but the mapping is nonlinear and task-dependent. Fit loss to downstream task accuracy (e.g., MMLU score) using small models, then extrapolate. Benchmark on a held-out subset of your target tasks at small scale to validate the relationship.
Comparing Models: Use scaling laws to predict performance of models you haven't trained yet. If a competitor released a 70B model trained on 1.4T tokens, use Chinchilla formulas to estimate its performance without training it yourself.
Efficiency Targets: Define your goal: max performance for a given compute budget (Chinchilla), max performance for a given inference cost (larger models, fewer tokens), or balanced (medium models, medium tokens). Adjust N and D accordingly.
Future Directions: As compute becomes more abundant, expect focus to shift from scaling training to scaling inference (test-time compute). Multi-modal models may have different scaling laws than pure language models. And continual learning (training on new data without forgetting old data) will have its own scaling dynamics.
Scaling laws compress decades of empirical ML intuition into a handful of power-law equations. The table below summarises the key results every practitioner should know, with practical implications for model selection and training budget allocation.
| Finding | Source | Key Equation / Rule | Practical Implication |
|---|---|---|---|
| Loss scales as power law in N, D, C | Kaplan 2020 | L ∝ N^-0.076, L ∝ D^-0.095 | Doubling params beats doubling data (at fixed C) |
| Compute-optimal: N and D should scale equally | Chinchilla 2022 | N_opt ∝ C^0.5, D_opt ∝ C^0.5 | Most 2020–2022 models were undertrained on data |
| Inference-adjusted: over-train for cheaper serving | Llama/Mistral practice | Train 5–10× Chinchilla-optimal tokens | Smaller, overtrained models > larger undertrained at same inference cost |
| Test-time compute: more thinking → better answers | OpenAI o1, DeepSeek R1 | Accuracy ∝ log(inference FLOPs) | Chain-of-thought and search scale separately from pretraining |
The practical takeaway for most engineering teams: fit a compute budget first, then use Chinchilla to set model size and token count, then adjust toward the over-trained regime if you expect high query volume. At production scale, inference cost per token dominates pretraining cost within 3–6 months — so err on the side of smaller, better-trained models.