Scaling Laws — GenAI Mindmap

What Are Scaling Laws
Kaplan et al. (2020)
Chinchilla (Hoffman 2022)
Emergent Abilities

Compute-Optimal Training
Inference-Time Scaling
Practical Implications

SECTION 01

What Are Scaling Laws

Scaling laws describe empirical power-law relationships between model size (number of parameters, N), dataset size (number of training tokens, D), compute budget (C, measured in FLOPs), and model performance (loss or downstream task accuracy). The key insight: loss decreases predictably as you scale model size, data, or compute, following a power law: Loss ≈ C * N^(-α), where α is the scaling exponent (typically 0.05-0.1) and C is a constant.

Scaling laws have profound implications. They suggest that: (1) Larger models and more data consistently improve performance—no wall hit, at least at current scales. (2) Performance is predictable: you can forecast downstream task accuracy by measuring loss on a smaller model. (3) Compute is the bottleneck: for a fixed compute budget, there's an optimal split between model size and training data. (4) Efficiency gains persist: past results continue to hold even at much larger scales.

The field has moved from trying to understand individual model architectures to asking: "How do the fundamental quantities (N, D, C) relate to performance?" This shift has enabled strategic planning: estimate your compute budget, use scaling laws to predict optimal N and D, and know roughly how well the model will perform before training.

Why This Matters Scaling laws remove uncertainty from model design. Instead of guessing whether a 70B model is better than a 7B model, you calculate the expected performance difference based on scaling exponents. This disciplined approach has driven the field's progress.

SECTION 02

Kaplan et al. (2020) Findings

In "Scaling Laws for Neural Language Models," OpenAI researchers trained models ranging from 10M to 10B parameters on datasets up to 400B tokens. They discovered three independent scaling laws:

1. Model Size (N)
Loss decreases with model size following: Loss_N = (N_0 / N)^α_N, where α_N ≈ 0.076. In practice: doubling model size improves loss by ~5-6%.

2. Dataset Size (D)
Loss decreases with dataset size following: Loss_D = (D_0 / D)^α_D, where α_D ≈ 0.095. Doubling dataset size improves loss by ~6-7%.

3. Compute (C)
Loss decreases with compute budget (measured in FLOPs) following: Loss_C = (C_0 / C)^α_C, where α_C ≈ 0.06. The relationship is weaker than N or D individually, but important for planning.

Combined Scaling Law
When all three factors are optimized simultaneously: Loss ≈ E + (A / N^α) + (B / D^β) + (C * L^γ), where E is irreducible loss (vocabulary effects, task difficulty), A, B, C are constants, and α, β, γ are exponents. The constants and exponents vary by task and model class.

Kaplan et al. Equations (Simplified): For a fixed compute budget C (FLOPs): optimal N and D satisfy: N_optimal ≈ C / (6 * D) This means: if you have C FLOPs to spend: - Allocate ~6*C to data (tokens) - Allocate C to model (FLOPs for weights) Example: C = 10^20 FLOPs (typical 7B model training) - N_optimal ≈ 10^20 / (6 * D) - If you plan D = 10^12 tokens, then N ≈ 1.67 * 10^7 ≈ 17M params But empirically, larger N works better. See Chinchilla. Loss scaling (empirical fit): L(N, D) ≈ 1.69 + 406 / N^0.34 + 410 / D^0.28 This means: - Loss at 10M params, 10B tokens ≈ 1.69 + 406/(10M^0.34) + 410/(10B^0.28) - Loss at 100M params, 100B tokens improves roughly linearly in log-log space

Implications: (1) No mysterious wall—loss improves smoothly as you scale. (2) The exponents are consistent across tasks and model families (Transformers with different architectures show similar exponents). (3) You can transfer knowledge: a scaling law fit on small models predicts performance on larger models.

Kaplan et al.'s Contribution First large-scale empirical study of scaling laws. Showed that power-law relationships hold reliably, enabling quantitative forecasting. Their compute-optimal allocation formula (N ≈ C / (6*D)) was influential, though later work (Chinchilla) showed larger models are better.

SECTION 03

Chinchilla (Hoffman 2022): Balanced Scaling

DeepMind's follow-up work, "Training Compute-Optimal Large Language Models," challenged Kaplan et al.'s conclusion that larger models need fewer tokens. Hoffman et al. trained models from 400M to 70B parameters, each on different token budgets (20T to 1T tokens), and fit their own scaling laws.

Key Finding: The 20× Rule
For compute-optimal training, model size (N) and data size (D) should scale equally: D ≈ 20 * N (measured in tokens). In other words, train each model on ~20 tokens per parameter. This is roughly 10× more data than Kaplan et al. recommended, and it yields significantly better downstream performance.

Compute-Optimal Exponents
Chinchilla fit: Loss = E + A / N^α + B / D^β, with α ≈ 0.050 and β ≈ 0.050 (symmetric). This symmetry is intuitive: model size and data size contribute equally to reducing loss. The optimal allocation balances them.

Practical Impact
A 70B model trained on 1.4T tokens (Chinchilla compute budget) outperforms GPT-3 (175B parameters, 300B tokens) while using less compute. This enabled models like Chinchilla, LLaMA, and Mistral: more compute-efficient than GPT-3, enabling efficient inference and faster iteration.

Chinchilla Compute-Optimal Formula: For a compute budget C (measured in FLOPs): C ≈ 6 * N * D (approximately, accounting for overhead) Compute-optimal allocation (N = D / 20): - For 7B params: train on 140B tokens - For 70B params: train on 1.4T tokens - For 1T params: train on 20T tokens Scaling law fit: L(N, D) ≈ a + b/N^0.050 + c/D^0.050 Result: Chinchilla 70B > GPT-3 175B on most benchmarks - Chinchilla: 70B params, 1.4T tokens - GPT-3: 175B params, 300B tokens - Compute for Chinchilla ≈ 70% of GPT-3 Python calculation: compute_budget = 5e20 # FLOPs optimal_n = (compute_budget / 6) ** 0.5 * 0.5 # roughly sqrt(C) optimal_d = optimal_n * 20 print(f"Optimal N: {optimal_n:.0e}, D: {optimal_d:.0e}")

Chinchilla's Impact Reshaped the field's thinking about compute efficiency. Showed that "bigger is better" needs to be balanced with "more data is crucial." Enabled smaller, more efficient models (LLaMA, Mistral) that compete with much larger models.

SECTION 04

Emergent Abilities at Scale

A surprising phenomenon: some capabilities appear suddenly at scale—models below a certain size perform at chance, then above a threshold, accuracy jumps to near-human. Examples: in-context learning (few-shot prompting), chain-of-thought reasoning, and mathematical problem-solving show emergent properties. This raises questions: Are emergent abilities real, or artifacts of evaluation metrics?

The Debate: Some researchers argue emergence is real—there are threshold phenomena where increasing scale unlocks new capabilities. Others (notably Schaeffer et al., 2024) argue emergence is primarily a measurement artifact: if you plot log-loss vs log-model-size, all curves are smooth power laws. The appearance of sudden emergence depends on the specific metric (accuracy on a binary classification task, for example, is a step function even for smooth underlying loss).

Evidence for Emergent Abilities: Few-shot in-context learning improves dramatically at scale. Chain-of-thought reasoning (asking the model to "think step by step") works well for larger models but not smaller ones. Solving complex math problems shows sharp improvements above 100B parameters.

Skeptical View: If you measure loss (not binary accuracy), large models and small models show consistent improvement—no sudden jump. The emergence is an artifact of thresholding: a task has a minimum loss required for success, and crossing that threshold looks sudden. But the underlying loss improves smoothly.

In practice, both views are useful. Plan for smooth scaling of loss, but be aware that specific downstream tasks (especially those with discrete success criteria) will show emergent-looking behavior at certain scales.

Emergent Abilities: Mathematical Reasoning: Model Size | Math Accuracy (%) 4B | 2% (random chance) 8B | 3% 16B | 5% 32B | 8% 64B | 15% 128B | 42% 256B | 67% 512B | 83% Looks emergent: below 64B, performance is near zero. Above 128B, performance jumps. Alternative view (loss-based): Loss smoothly decreases with model size. But "correct answer" requires loss < threshold. Below 128B, loss hasn't crossed the threshold yet. Takeaway: Plan for smooth loss improvement, but expect emergent-like jumps in discrete task performance as models cross critical thresholds.

Emergence Caution Don't assume new capabilities emerge at your scale. Test empirically. Some capabilities (in-context learning, arithmetic) require very large models. Others (text classification, simple summarization) appear at modest scales. Design your training budget around well-understood capabilities, then look for surprises.

SECTION 05

Compute-Optimal Training in Practice

Step 1: Estimate Compute Budget
How many FLOPs do you have? A training run on 8 H100 GPUs for 30 days, with peak throughput of ~1.5e15 FLOPs/GPU/second, yields roughly C = 8 * 1.5e15 * (30 * 86400) ≈ 3e24 FLOPs. Account for overhead (not all compute is utilized)—typical utilization is 50-70% of peak.

Step 2: Apply Chinchilla Formula
Given C, compute: N_optimal ≈ sqrt(C / 6), D_optimal ≈ 20 * N_optimal. For C = 3e24: N ≈ 70B params, D ≈ 1.4T tokens.

Step 3: Validate Against Empirical Loss Curves
Fit scaling laws using small pilot runs (e.g., train a 1B model for a few billion tokens). Measure loss and fit: Loss ≈ E + A / N^0.05 + B / D^0.05. Use the fit to predict downstream performance (MMLU, HellaSwag, etc.) for your target model size.

Step 4: Account for Task-Specific Scaling
Scaling laws vary by task. Arithmetic is slower to improve than language modeling. Few-shot learning requires larger models. Adjust your allocation based on your primary use case.

Step 5: Plan Training Duration
For N params and D tokens, effective batch size is key. Typical batch size: 0.5M to 4M tokens per step. Training steps: D / batch_size. A 70B model trained on 1.4T tokens with 2M-token batches = 700k steps. On 8 H100s, ~10k steps per day = 70 days.

Compute-Optimal Planning Spreadsheet: Inputs: - Compute budget (FLOPs): 1e24 - Utilization (%): 60% - Effective compute: 6e23 FLOPs Chinchilla formula: N_optimal = sqrt(C / 6) = sqrt(1e23) ≈ 10B params D_optimal = 20 * N = 200B tokens Verify: - FLOPs for forward/backward: ~6 * N * D = 6 * 10B * 200B = 1.2e24 FLOPs (check: ~2x compute budget, accounting for validation, checkpointing overhead) Predict downstream performance: - Fit loss curve on small model (1B, 10B tokens) - Loss_1B ≈ 3.5 - Extrapolate to 10B: Loss_10B ≈ 3.5 - A/10B^0.05 (assuming A fitted from small model) Training schedule: - Batch size: 2M tokens - Training steps: 200B / 2M = 100k steps - Steps per day: ~5k (on typical GPU setup) - Duration: ~20 days Result: Predict 10B model performance after 20 days of training

Beyond Chinchilla Chinchilla assumes efficient training. With inference efficiency in mind, some labs use even larger models (80B, 100B) and fewer tokens. This trades training cost for inference efficiency—relevant if you're running a model in production.

SECTION 06

Inference-Time Scaling & Test-Time Compute

Traditional scaling laws apply to training. But recent work on inference-time scaling—models like OpenAI's o1 and DeepSeek-R1—show that test-time compute (how much the model reasons/computes before outputting an answer) also improves performance. This opens a new frontier: instead of only scaling training, scale reasoning.

Chain-of-Thought and Reasoning: Models can use more compute at inference time by generating intermediate reasoning steps (chain-of-thought). The model doesn't just jump to an answer—it thinks through the problem. This longer generation requires more inference compute but improves accuracy, especially on hard reasoning tasks (math, coding, logic puzzles).

Inference Scaling Laws: Preliminary results suggest inference performance improves as a power law with compute: Accuracy ≈ f(N) * g(C_test), where N is model size and C_test is inference-time compute. This suggests even modest-sized models can achieve high accuracy on reasoning tasks with enough test-time compute.

Practical Implications: For reasoning-heavy applications (coding assistance, math tutoring, research automation), allocate budget to inference compute, not just training. A smaller model with longer reasoning might outperform a larger model with direct prediction.

Inference-Time Scaling Example: Model: 7B parameters (trained on 140B tokens) Direct answer (1 forward pass): - Latency: 0.1 seconds - Accuracy on GSM8K (math): 35% With chain-of-thought (10-20 reasoning steps): - Latency: 1 second - Accuracy on GSM8K: 65% With more reasoning (50+ steps): - Latency: 5 seconds - Accuracy on GSM8K: 72% Scaling relationship: Accuracy(n_steps) ≈ 35% + (50% - 35%) * log(n_steps) / log(100) This shows inference-time compute compounds performance. Compare to training a 70B model (10x larger): - Training cost: 10x more - Inference cost: 10x more - Accuracy gain: ~20-30% Sometimes more inference compute on smaller model is more efficient than a larger model.

Emerging Paradigm The field is shifting from "bigger models" to "better allocation of compute." Compute can be spent during training (parameters, tokens) or during inference (reasoning steps). This flexibility enables more efficient systems.

SECTION 07

Practical Implications for Model Design

Planning a Training Run: (1) Estimate your compute budget (FLOPs available). (2) Use Chinchilla formula to find N and D. (3) Fit scaling laws on small-scale runs (1-8B parameters on 10-100B tokens). (4) Extrapolate to your target scale. (5) Run the full training and measure downstream task performance. (6) If performance is below target, rerun with updated D and N.

Predicting Downstream Performance: Loss is a proxy for downstream accuracy, but the mapping is nonlinear and task-dependent. Fit loss to downstream task accuracy (e.g., MMLU score) using small models, then extrapolate. Benchmark on a held-out subset of your target tasks at small scale to validate the relationship.

Comparing Models: Use scaling laws to predict performance of models you haven't trained yet. If a competitor released a 70B model trained on 1.4T tokens, use Chinchilla formulas to estimate its performance without training it yourself.

Efficiency Targets: Define your goal: max performance for a given compute budget (Chinchilla), max performance for a given inference cost (larger models, fewer tokens), or balanced (medium models, medium tokens). Adjust N and D accordingly.

Scaling Law Pitfalls Scaling laws are empirical, fit to specific model families (Transformers with certain architectures, attention patterns). Changing the model class (e.g., using mixture-of-experts) can change the scaling exponents. Always validate on your specific setup. Also, extrapolation is risky: laws fit on 7B-70B models may not hold at 1T parameters.

Future Directions: As compute becomes more abundant, expect focus to shift from scaling training to scaling inference (test-time compute). Multi-modal models may have different scaling laws than pure language models. And continual learning (training on new data without forgetting old data) will have its own scaling dynamics.

The Takeaway Scaling laws transform model development from intuition-based to data-driven. Know your compute budget, plug into Chinchilla, and expect smooth improvements as you scale N and D. This disciplined approach has driven the field's progress from GPT-2 to GPT-4 and beyond.

Finding	Source	Key Equation / Rule	Practical Implication
Loss scales as power law in N, D, C	Kaplan 2020	L ∝ N^-0.076, L ∝ D^-0.095	Doubling params beats doubling data (at fixed C)
Compute-optimal: N and D should scale equally	Chinchilla 2022	N_opt ∝ C^0.5, D_opt ∝ C^0.5	Most 2020–2022 models were undertrained on data
Inference-adjusted: over-train for cheaper serving	Llama/Mistral practice	Train 5–10× Chinchilla-optimal tokens	Smaller, overtrained models > larger undertrained at same inference cost
Test-time compute: more thinking → better answers	OpenAI o1, DeepSeek R1	Accuracy ∝ log(inference FLOPs)	Chain-of-thought and search scale separately from pretraining

Scaling Laws for LLMs

Table of Contents

What Are Scaling Laws

Kaplan et al. (2020) Findings

Chinchilla (Hoffman 2022): Balanced Scaling

Emergent Abilities at Scale

Compute-Optimal Training in Practice

Inference-Time Scaling & Test-Time Compute

Practical Implications for Model Design

Scaling Laws Quick Reference

Scaling Laws for LLMs

Table of Contents

What Are Scaling Laws

Kaplan et al. (2020) Findings

Chinchilla (Hoffman 2022): Balanced Scaling

Emergent Abilities at Scale

Compute-Optimal Training in Practice

Inference-Time Scaling & Test-Time Compute

Practical Implications for Model Design

Scaling Laws Quick Reference

Related concepts