Foundations · Mathematics

Math Foundations for ML

Linear algebra, calculus, probability, and statistics — the mathematical toolkit behind transformers and training

4areas
7sections
NumPy-firstexamples
Contents
  1. Why math matters
  2. Linear algebra
  3. Calculus
  4. Probability
  5. Statistics
  6. Key identities
  7. References
01 — Foundation

Why Math Matters for LLMs

Deep learning is applied mathematics. Transformers use linear algebra (matrix multiplication for attention), calculus (backpropagation via chain rule), and probability (softmax, sampling). Understanding the math helps you debug, optimize, and innovate.

Math in the LLM Pipeline

💡 Good news: You don't need PhD-level math. Solid undergrad math + NumPy intuition gets you far.
02 — Vectors & Matrices

Linear Algebra Essentials

Vectors are lists of numbers (embeddings). Matrices are 2D grids (weights, attention scores). Key operations: dot product (similarity), matrix multiply (transformations), eigenvalues (importance).

Core Concepts

# Vectors as embeddings import numpy as np v1 = np.array([1, 0, 0]) # embedding 1 v2 = np.array([1, 1, 0]) # embedding 2 # Dot product = similarity similarity = np.dot(v1, v2) # 1 print(f"Similarity: {similarity}") # Matrix multiply = attention scoring Q = np.array([[1, 0], [0, 1]]) # Query K = np.array([[1, 0], [0, 1]]) # Key T = Q @ K.T # attention scores print(f"Attention:\n{T}") # SVD = decomposition (used in dimensionality reduction) U, s, Vt = np.linalg.svd(T) print(f"Singular values: {s}")

Why It Matters for LLMs

03 — Derivatives

Calculus for Optimization

Training is optimization: find weights that minimize loss. Derivatives (gradients) point downhill. Chain rule lets us backprop through layers to compute gradients.

Gradient Descent Loop

# Simple gradient descent learning_rate = 0.01 for epoch in range(100): # Forward pass: compute loss loss = compute_loss(weights, X, y) # Backward pass: compute gradients (using chain rule) gradients = compute_gradients(loss) # Update weights (move downhill) weights -= learning_rate * gradients print(f"Final loss: {loss}")

Chain Rule in Backprop

If loss = f(g(h(x))), then ∂loss/∂x = ∂f/∂g · ∂g/∂h · ∂h/∂x. This is how gradients flow backward through layers.

04 — Distributions

Probability & Information Theory

Softmax converts logits to probabilities. Cross-entropy is the loss function. KL divergence measures distribution difference. Entropy measures uncertainty.

# Softmax: logits → probabilities import numpy as np from scipy.special import softmax logits = np.array([1.0, 2.0, 0.5]) probs = softmax(logits) # [0.21, 0.58, 0.21] print(f"Probabilities: {probs}") # Cross-entropy loss true_probs = np.array([0, 1, 0]) loss = -np.sum(true_probs * np.log(probs)) print(f"Loss: {loss}") # KL divergence: how much P differs from Q kl = np.sum(true_probs * np.log(true_probs / probs)) print(f"KL divergence: {kl}")

Key Quantities

05 — Evaluation

Statistics for Model Evaluation

Mean, variance, confidence intervals, hypothesis testing. Used to assess model performance reliably and compare models statistically.

Common Statistics

# Confidence interval via bootstrapping import numpy as np scores = np.array([0.85, 0.88, 0.82, 0.90, 0.87]) bootstrap_means = [] for _ in range(1000): sample = np.random.choice(scores, size=len(scores), replace=True) bootstrap_means.append(np.mean(sample)) ci_lower = np.percentile(bootstrap_means, 2.5) ci_upper = np.percentile(bootstrap_means, 97.5) print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
06 — Quick Reference

Key Identities

Name Formula Where Used
Dot product a · b = Σ a_i · b_i Attention similarity
Softmax softmax(x)_i = e^x_i / Σ e^x_j Probability normalization
Cross-entropy -Σ y_i log(ŷ_i) Classification loss
KL divergence Σ P(x) log(P(x)/Q(x)) Distribution distance
Chain rule ∂y/∂x = (∂y/∂u) · (∂u/∂x) Backpropagation
Tools & Resources

Learning Resources

Library
NumPy
Numerical computing with arrays and linear algebra
Library
SciPy
Scientific computing with optimization, stats, linear algebra
Library
SymPy
Symbolic math for derivatives and algebraic manipulation
Framework
PyTorch
Auto differentiation (autograd) for backpropagation
Education
Khan Academy
Free math fundamentals (linear algebra, calculus)
Education
3Blue1Brown
Visual explanations of linear algebra and neural networks
07 — Further Reading

References

Textbooks
Courses & Videos