01 — Foundation
Why Math Matters for LLMs
Deep learning is applied mathematics. Transformers use linear algebra (matrix multiplication for attention), calculus (backpropagation via chain rule), and probability (softmax, sampling). Understanding the math helps you debug, optimize, and innovate.
Math in the LLM Pipeline
- Linear algebra: Matrix operations in attention (Q × K^T), embeddings as high-dim vectors
- Calculus: Gradients in backprop, loss function optimization
- Probability: Cross-entropy loss, softmax sampling, temperature scaling
- Statistics: Confidence intervals for evaluation, statistical significance testing
💡
Good news: You don't need PhD-level math. Solid undergrad math + NumPy intuition gets you far.
02 — Vectors & Matrices
Linear Algebra Essentials
Vectors are lists of numbers (embeddings). Matrices are 2D grids (weights, attention scores). Key operations: dot product (similarity), matrix multiply (transformations), eigenvalues (importance).
Core Concepts
# Vectors as embeddings
import numpy as np
v1 = np.array([1, 0, 0]) # embedding 1
v2 = np.array([1, 1, 0]) # embedding 2
# Dot product = similarity
similarity = np.dot(v1, v2) # 1
print(f"Similarity: {similarity}")
# Matrix multiply = attention scoring
Q = np.array([[1, 0], [0, 1]]) # Query
K = np.array([[1, 0], [0, 1]]) # Key
T = Q @ K.T # attention scores
print(f"Attention:\n{T}")
# SVD = decomposition (used in dimensionality reduction)
U, s, Vt = np.linalg.svd(T)
print(f"Singular values: {s}")
Why It Matters for LLMs
- Embeddings: Text as vectors in embedding space (200M-1.2B dimensions for LLMs)
- Attention: Similarity (dot product) between tokens determines focus
- Transformations: Feed-forward layers are matrix multiplies (learned transformations)
03 — Derivatives
Calculus for Optimization
Training is optimization: find weights that minimize loss. Derivatives (gradients) point downhill. Chain rule lets us backprop through layers to compute gradients.
Gradient Descent Loop
# Simple gradient descent
learning_rate = 0.01
for epoch in range(100):
# Forward pass: compute loss
loss = compute_loss(weights, X, y)
# Backward pass: compute gradients (using chain rule)
gradients = compute_gradients(loss)
# Update weights (move downhill)
weights -= learning_rate * gradients
print(f"Final loss: {loss}")
Chain Rule in Backprop
If loss = f(g(h(x))), then ∂loss/∂x = ∂f/∂g · ∂g/∂h · ∂h/∂x. This is how gradients flow backward through layers.
04 — Distributions
Probability & Information Theory
Softmax converts logits to probabilities. Cross-entropy is the loss function. KL divergence measures distribution difference. Entropy measures uncertainty.
# Softmax: logits → probabilities
import numpy as np
from scipy.special import softmax
logits = np.array([1.0, 2.0, 0.5])
probs = softmax(logits) # [0.21, 0.58, 0.21]
print(f"Probabilities: {probs}")
# Cross-entropy loss
true_probs = np.array([0, 1, 0])
loss = -np.sum(true_probs * np.log(probs))
print(f"Loss: {loss}")
# KL divergence: how much P differs from Q
kl = np.sum(true_probs * np.log(true_probs / probs))
print(f"KL divergence: {kl}")
Key Quantities
- Entropy: H(p) = -Σ p(x) log p(x) — uncertainty in distribution
- Cross-entropy: -Σ true(x) log pred(x) — loss function (matches true labels)
- KL divergence: D(P||Q) = Σ P(x) log(P/Q) — distance between distributions
05 — Evaluation
Statistics for Model Evaluation
Mean, variance, confidence intervals, hypothesis testing. Used to assess model performance reliably and compare models statistically.
Common Statistics
- Mean/variance: Average performance ± spread
- Confidence intervals: Range where true metric likely lies (e.g., 95% CI)
- Bootstrapping: Resample data to estimate confidence without assumptions
- A/B testing: T-test for statistical significance between models
# Confidence interval via bootstrapping
import numpy as np
scores = np.array([0.85, 0.88, 0.82, 0.90, 0.87])
bootstrap_means = []
for _ in range(1000):
sample = np.random.choice(scores, size=len(scores), replace=True)
bootstrap_means.append(np.mean(sample))
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)
print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
06 — Quick Reference
Key Identities
| Name |
Formula |
Where Used |
| Dot product |
a · b = Σ a_i · b_i |
Attention similarity |
| Softmax |
softmax(x)_i = e^x_i / Σ e^x_j |
Probability normalization |
| Cross-entropy |
-Σ y_i log(ŷ_i) |
Classification loss |
| KL divergence |
Σ P(x) log(P(x)/Q(x)) |
Distribution distance |
| Chain rule |
∂y/∂x = (∂y/∂u) · (∂u/∂x) |
Backpropagation |
Tools & Resources
Learning Resources
07 — Further Reading
References
Textbooks
Courses & Videos