Math, Python, and core ML concepts — the prerequisites that make everything else click
Strong fundamentals in math, Python, and ML theory accelerate learning of everything else. Trying to understand transformers without linear algebra is like trying to understand music theory without hearing intervals. The concepts click when you see them grounded in math and working code.
This section covers the essential bedrock: linear algebra (matrices, vectors), calculus (derivatives, chain rule), probability (distributions, independence), Python (NumPy, Pandas), and PyTorch (tensor operations, autograd).
| Topic | Why It Matters for LLMs | Key Concept | Resource |
|---|---|---|---|
| Linear algebra | Every layer is matrix multiply + activation | Matrix multiply, dot product, norms | 3Blue1Brown Essence of LA |
| Calculus | Backpropagation is chain rule applied repeatedly | Partial derivatives, chain rule | Karpathy micrograd |
| Probability | LLMs output token probability distributions | Softmax, cross-entropy, KL divergence | Blitzstein & Hwang |
| Optimization | Training = gradient descent toward lower loss | SGD, Adam, learning rate schedules | fast.ai Practical DL |
| NumPy / PyTorch | Implement every concept as runnable code | Tensor ops, autograd, nn.Module | PyTorch tutorials |
Vectors: 1D arrays. Operations: dot product, norm, distance. Matrices: 2D arrays. Operations: multiply, transpose, inverse, determinant. Eigenvalues & eigenvectors: For understanding covariance, PCA. Vector spaces & subspaces: Foundation for dimensionality reduction.
Derivatives: Rate of change. Used to find loss minima (gradient descent). Partial derivatives: Derivative w.r.t. one variable. Essential for multivariable optimization. Chain rule: Compute derivatives of composite functions. The core of backpropagation. Gradient: Vector of partial derivatives. Points in direction of steepest ascent.
Distributions: Normal (Gaussian), uniform, categorical. Expectation & variance: Mean and spread of random variables. Independence: Two events/variables are independent if P(A,B) = P(A)P(B). Bayes' rule: P(A|B) = P(B|A)P(A) / P(B). Fundamental to Bayesian inference.
import numpy as np
# Vectors and basic operations
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([4.0, 5.0, 6.0])
print(f"Dot product: {np.dot(v1, v2)}") # 32.0
print(f"L2 norm: {np.linalg.norm(v1):.4f}") # 3.7417
print(f"Cosine sim: {np.dot(v1,v2) / (np.linalg.norm(v1)*np.linalg.norm(v2)):.4f}") # 0.9746
# Matrix operations (the core of neural networks)
A = np.random.randn(4, 8) # 4 rows, 8 columns
B = np.random.randn(8, 16) # 8 rows, 16 columns
C = A @ B # matrix multiply → shape (4, 16)
print(f"A @ B shape: {C.shape}")
# Broadcasting — apply same operation to each row
batch = np.random.randn(32, 512) # 32 tokens, 512-dim embeddings
scale = np.random.randn(512) # per-dimension scale
scaled = batch * scale # broadcasts: (32, 512) * (512,) → (32, 512)
# Eigenvalues — used in PCA, understanding attention patterns
M = np.random.randn(4, 4)
M = M @ M.T # make symmetric positive definite
eigenvalues, eigenvectors = np.linalg.eigh(M)
print(f"Eigenvalues: {eigenvalues.round(2)}")
NumPy is the foundation of Python data science. It provides multidimensional arrays (ndarrays) and linear algebra operations. All matrices and vectors in GenAI work are NumPy or PyTorch tensors (which extend NumPy).
Pandas provides DataFrames: labeled 2D tables (like SQL or Excel). Essential for EDA (exploratory data analysis), loading datasets, handling missing values, groupby aggregations.
Matplotlib is the low-level plotting library. Seaborn is a higher-level wrapper for statistical plots. Essential for understanding data distributions, loss curves, attention patterns.
import numpy as np
# 1. Probability distributions
def softmax(x):
e = np.exp(x - x.max())
return e / e.sum()
logits = np.array([3.0, 1.0, 0.2])
probs = softmax(logits)
print(f"Softmax: {probs.round(3)}") # [0.844, 0.114, 0.042]
# 2. Sampling with temperature
def sample_with_temperature(logits, temperature=1.0):
"""Higher temp = more random; lower temp = more greedy."""
scaled = logits / temperature
probs = softmax(scaled)
return np.random.choice(len(probs), p=probs)
token = sample_with_temperature(logits, temperature=0.7)
# 3. Cross-entropy loss — the training objective for LLMs
def cross_entropy(logits, target_idx):
probs = softmax(logits)
return -np.log(probs[target_idx] + 1e-9)
loss = cross_entropy(logits, target_idx=0) # high confidence → low loss
print(f"Cross-entropy loss: {loss:.4f}") # 0.1699
# 4. KL Divergence — measures distribution distance (used in RLHF)
def kl_divergence(p, q):
"""KL(P || Q) — how different Q is from the reference P."""
return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))
# Reference model vs fine-tuned model output distributions
p_ref = softmax(np.array([2.0, 1.5, 0.5]))
p_new = softmax(np.array([1.8, 1.6, 0.8]))
kl = kl_divergence(p_ref, p_new)
print(f"KL divergence: {kl:.4f}") # small = fine-tuned close to reference
You don't need to master all of calculus, linear algebra, and probability before building. A practical minimum for starting LLM engineering work: understand what a matrix multiply is (it's a weighted combination), understand that softmax converts numbers to probabilities, understand that training minimizes a loss function via gradient steps. That's enough to read papers and understand what's happening.
The deeper foundations pay off over time — especially when debugging training instabilities, reading architecture papers, or designing evaluation metrics. But don't block shipping on a math prerequisite. Build first, deepen foundations in parallel.
import numpy as np
# 1. Matrix multiply — the core of every transformer layer
W = np.random.randn(512, 512) # weight matrix
x = np.random.randn(512) # input token embedding
out = W @ x # linear projection: (512,)
# 2. Softmax — converts raw scores to a probability distribution
def softmax(logits):
e = np.exp(logits - logits.max()) # subtract max for stability
return e / e.sum()
vocab_logits = np.random.randn(50000) # one score per token
token_probs = softmax(vocab_logits) # sums to 1.0
next_token = token_probs.argmax() # greedy decoding
# 3. Cross-entropy loss — measures how wrong the model's prediction is
def cross_entropy(probs, target_idx):
return -np.log(probs[target_idx] + 1e-9) # add eps to avoid log(0)
loss = cross_entropy(token_probs, target_idx=42)
# 4. Cosine similarity — used in embedding search
def cos_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
emb1 = np.random.randn(1536) # OpenAI text-embedding-3-small dimension
emb2 = np.random.randn(1536)
similarity = cos_sim(emb1, emb2) # range: -1 to 1
# 5. Layer norm — stabilises training, used in every transformer block
def layer_norm(x, eps=1e-5):
mean = x.mean()
std = x.std()
return (x - mean) / (std + eps)
normed = layer_norm(out) # zero mean, unit variance
PyTorch tensors are arrays on GPU/CPU with autograd support. They're like NumPy arrays but can compute gradients automatically. Shapes matter: (batch, seq_len, hidden_dim) is standard for transformer inputs.
Set requires_grad=True on input tensors. Perform operations. Call .backward() on loss. PyTorch traces the computation graph and computes gradients automatically via chain rule. This is the core of all training.
Modules encapsulate layers: Linear, Conv2d, Embedding, Dropout, etc. Subclass nn.Module to define custom architectures. Register parameters so optimizer can find them.
torch.optim provides optimizers: SGD, Adam, AdamW. Create optimizer with torch.optim.Adam(model.parameters(), lr=...). Call zero_grad() before backprop, then step() after.
import numpy as np
import torch
# 1. Matrix multiplication (linear algebra)
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = A @ B
print(f'Matrix mult: {C}')
# 2. Gradient computation (calculus/autograd)
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1
y.backward()
print(f'dy/dx at x=2: {x.grad}') # Should be 7.0
# 3. Softmax + entropy (probability)
logits = torch.tensor([1.0, 2.0, 0.5])
probs = torch.softmax(logits, dim=0)
entropy = -(probs * torch.log(probs)).sum()
print(f'Entropy: {entropy:.4f}')
# 4. Mini neural network forward pass
x = torch.randn(4, 10) # batch=4, features=10
W = torch.randn(10, 5) # weight matrix
b = torch.randn(5) # bias
y = torch.relu(x @ W + b)
print(f'Output shape: {y.shape}') # [4, 5]
Day 1-2: Linear algebra fundamentals. Vectors, matrices, dot product, transpose, matrix multiply. 3Blue1Brown "Essence of Linear Algebra" on YouTube. Day 3-4: Calculus. Derivatives, partial derivatives, chain rule. Understand gradients geometrically. Day 5: Probability. Distributions, independence, Bayes. Day 6-7: NumPy hands-on. Implement matrix operations, create random distributions, basic aggregations.
Day 8-9: PyTorch tensors. Create tensors, reshape, slicing, broadcasting. Day 10-11: Autograd. Write a function, compute gradients, understand the computation graph. Day 12: Build a tiny neural network. Define forward, compute loss, backprop, update weights manually. Day 13-14: Small projects. Fit a network to toy data. Implement simple optimization loop.
Explore each foundation in detail: