Math Foundations for ML

Contents

Why math matters
Linear algebra
Calculus
Probability
Statistics
Key identities
References

01 — Foundation

Why Math Matters for LLMs

Deep learning is applied mathematics. Transformers use linear algebra (matrix multiplication for attention), calculus (backpropagation via chain rule), and probability (softmax, sampling). Understanding the math helps you debug, optimize, and innovate.

Math in the LLM Pipeline

Linear algebra: Matrix operations in attention (Q × K^T), embeddings as high-dim vectors
Calculus: Gradients in backprop, loss function optimization
Probability: Cross-entropy loss, softmax sampling, temperature scaling
Statistics: Confidence intervals for evaluation, statistical significance testing

💡 Good news: You don't need PhD-level math. Solid undergrad math + NumPy intuition gets you far.

02 — Vectors & Matrices

Linear Algebra Essentials

Vectors are lists of numbers (embeddings). Matrices are 2D grids (weights, attention scores). Key operations: dot product (similarity), matrix multiply (transformations), eigenvalues (importance).

Core Concepts

# Vectors as embeddings import numpy as np v1 = np.array([1, 0, 0]) # embedding 1 v2 = np.array([1, 1, 0]) # embedding 2 # Dot product = similarity similarity = np.dot(v1, v2) # 1 print(f"Similarity: {similarity}") # Matrix multiply = attention scoring Q = np.array([[1, 0], [0, 1]]) # Query K = np.array([[1, 0], [0, 1]]) # Key T = Q @ K.T # attention scores print(f"Attention:\n{T}") # SVD = decomposition (used in dimensionality reduction) U, s, Vt = np.linalg.svd(T) print(f"Singular values: {s}")

Why It Matters for LLMs

Embeddings: Text as vectors in embedding space (200M-1.2B dimensions for LLMs)
Attention: Similarity (dot product) between tokens determines focus
Transformations: Feed-forward layers are matrix multiplies (learned transformations)

03 — Derivatives

Calculus for Optimization

Training is optimization: find weights that minimize loss. Derivatives (gradients) point downhill. Chain rule lets us backprop through layers to compute gradients.

Gradient Descent Loop

# Simple gradient descent learning_rate = 0.01 for epoch in range(100): # Forward pass: compute loss loss = compute_loss(weights, X, y) # Backward pass: compute gradients (using chain rule) gradients = compute_gradients(loss) # Update weights (move downhill) weights -= learning_rate * gradients print(f"Final loss: {loss}")

Chain Rule in Backprop

If loss = f(g(h(x))), then ∂loss/∂x = ∂f/∂g · ∂g/∂h · ∂h/∂x. This is how gradients flow backward through layers.

04 — Distributions

Probability & Information Theory

Softmax converts logits to probabilities. Cross-entropy is the loss function. KL divergence measures distribution difference. Entropy measures uncertainty.

# Softmax: logits → probabilities import numpy as np from scipy.special import softmax logits = np.array([1.0, 2.0, 0.5]) probs = softmax(logits) # [0.21, 0.58, 0.21] print(f"Probabilities: {probs}") # Cross-entropy loss true_probs = np.array([0, 1, 0]) loss = -np.sum(true_probs * np.log(probs)) print(f"Loss: {loss}") # KL divergence: how much P differs from Q kl = np.sum(true_probs * np.log(true_probs / probs)) print(f"KL divergence: {kl}")

Key Quantities

Entropy: H(p) = -Σ p(x) log p(x) — uncertainty in distribution
Cross-entropy: -Σ true(x) log pred(x) — loss function (matches true labels)
KL divergence: D(P||Q) = Σ P(x) log(P/Q) — distance between distributions

05 — Evaluation

Statistics for Model Evaluation

Mean, variance, confidence intervals, hypothesis testing. Used to assess model performance reliably and compare models statistically.

Common Statistics

Mean/variance: Average performance ± spread
Confidence intervals: Range where true metric likely lies (e.g., 95% CI)
Bootstrapping: Resample data to estimate confidence without assumptions
A/B testing: T-test for statistical significance between models

# Confidence interval via bootstrapping import numpy as np scores = np.array([0.85, 0.88, 0.82, 0.90, 0.87]) bootstrap_means = [] for _ in range(1000): sample = np.random.choice(scores, size=len(scores), replace=True) bootstrap_means.append(np.mean(sample)) ci_lower = np.percentile(bootstrap_means, 2.5) ci_upper = np.percentile(bootstrap_means, 97.5) print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")

06 — Quick Reference

Key Identities

Name	Formula	Where Used
Dot product	a · b = Σ a_i · b_i	Attention similarity
Softmax	softmax(x)_i = e^x_i / Σ e^x_j	Probability normalization
Cross-entropy	-Σ y_i log(ŷ_i)	Classification loss
KL divergence	Σ P(x) log(P(x)/Q(x))	Distribution distance
Chain rule	∂y/∂x = (∂y/∂u) · (∂u/∂x)	Backpropagation

Tools & Resources

Learning Resources

Library

NumPy

Numerical computing with arrays and linear algebra

Library

SciPy

Scientific computing with optimization, stats, linear algebra

Library

SymPy

Symbolic math for derivatives and algebraic manipulation

Framework

PyTorch

Auto differentiation (autograd) for backpropagation

Education

Khan Academy

Free math fundamentals (linear algebra, calculus)

Education

3Blue1Brown

Visual explanations of linear algebra and neural networks

07 — Further Reading

References

Textbooks

Docs Goodfellow, Bengio, Courville. Deep Learning. (Chapter 2-4: Linear Algebra, Probability, Numerical) deeplearningbook.org ↗
Docs Mathematics for Machine Learning (Free)

Courses & Videos

Math Foundations for ML

Why Math Matters for LLMs

Math in the LLM Pipeline

Linear Algebra Essentials

Core Concepts

Why It Matters for LLMs

Calculus for Optimization

Gradient Descent Loop

Chain Rule in Backprop

Probability & Information Theory

Key Quantities

Statistics for Model Evaluation

Common Statistics

Key Identities

Learning Resources

References

Related concepts