ML Foundations

Linear Algebra

Vectors, matrices, dot products, and SVD — the mathematical bedrock of embeddings, attention, and every transformer forward pass.

Foundation
O(n²)
Attention Cost
SVD
Key Tool

Table of Contents

SECTION 01

Why Linear Algebra Matters

Every forward pass through a transformer is a sequence of linear algebra operations. Tokens become vectors, attention is scaled dot products, and feed-forward layers are matrix multiplications with nonlinearities.

Core insight: If you understand matrix multiplication and dot products, you understand 80% of how transformers work mechanically.

Where linear algebra appears in ML:

SECTION 02

Vectors & Embeddings

A vector is a list of numbers representing a point in high-dimensional space. In GenAI, tokens, sentences, and documents are all represented as vectors (embeddings).

import numpy as np # Embedding vector: token "cat" → 768-dim vector cat = np.array([0.2, -0.1, 0.8, ...]) # 768 dims in reality # Dot product = similarity measure dog = np.array([0.18, -0.09, 0.75, ...]) similarity = np.dot(cat, dog) # High → semantically similar # Cosine similarity (normalized dot product) def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) print(cosine_sim(cat, dog)) # ~0.97 (similar) print(cosine_sim(cat, np.array([...airplane_vector...]))) # ~0.3 (dissimilar)

Key vector operations:

SECTION 03

Matrix Multiplication & Attention

Matrix multiplication is the workhorse of deep learning. For matrices A (m×k) and B (k×n), C = AB is an (m×n) matrix where Cᵢⱼ = Σ Aᵢₖ Bₖⱼ.

import torch # Attention mechanism is matrix multiplication seq_len, d_model, d_k = 512, 768, 64 Q = torch.randn(seq_len, d_k) # Query: (512, 64) K = torch.randn(seq_len, d_k) # Key: (512, 64) V = torch.randn(seq_len, d_k) # Value: (512, 64) # Attention scores: Q·Kᵀ scores = Q @ K.T # (512, 512) — every token attends to every other # Scale to prevent vanishing gradients scores = scores / (d_k ** 0.5) # Softmax to get attention weights weights = torch.softmax(scores, dim=-1) # (512, 512), rows sum to 1 # Weighted sum of values output = weights @ V # (512, 64) — attended representation

Why scaling by √d_k? With large d_k, dot products grow in magnitude, pushing softmax into saturation. Dividing by √d_k keeps variance ≈ 1.

SECTION 04

Eigenvalues & SVD

Singular Value Decomposition (SVD) decomposes any matrix M into U Σ Vᵀ. This underpins dimensionality reduction, LoRA, and understanding of weight matrices.

import numpy as np # SVD of a weight matrix W = np.random.randn(768, 768) U, S, Vt = np.linalg.svd(W, full_matrices=False) # U: (768, 768), S: (768,) singular values, Vt: (768, 768) # Low-rank approximation (like LoRA) r = 16 # Keep top-16 singular values W_approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :] print(f"Original: {W.shape}, Approximation error: {np.linalg.norm(W - W_approx):.4f}") # PCA via SVD — reduce 768-dim embeddings to 2D for visualization embeddings = np.random.randn(1000, 768) # 1000 embeddings U, S, Vt = np.linalg.svd(embeddings - embeddings.mean(0), full_matrices=False) reduced = embeddings @ Vt[:2].T # (1000, 2) — plot in 2D!
LoRA connection: LoRA works because weight updates during fine-tuning are empirically low-rank. SVD confirms this: most singular values of ΔW are near zero, so a rank-8 approximation captures most of the change.
SECTION 05

NumPy Essentials

NumPy is the linear algebra toolkit for Python ML. Most deep learning code uses PyTorch tensors but the same operations apply.

import numpy as np # Matrix operations A = np.random.randn(4, 3) B = np.random.randn(3, 5) C = A @ B # Matrix multiply: (4, 5) D = A.T # Transpose: (3, 4) # Broadcasting — critical for batch operations batch = np.random.randn(32, 768) # 32 embeddings mean = batch.mean(axis=0) # (768,) — mean over batch normalized = (batch - mean) / batch.std(axis=0) # LayerNorm-like # Useful functions np.linalg.norm(A) # Frobenius norm np.linalg.inv(A @ A.T) # Matrix inverse (square matrices) np.einsum('bi,bj->bij', A, B) # Einstein summation — flexible matmul

Broadcasting rules: NumPy automatically expands dims of size 1 to match. (32, 768) - (768,) works because (768,) broadcasts to (32, 768).

SECTION 06

Common Pitfalls

MistakeSymptomFix
Shape mismatch in matmulRuntimeError: mat1 and mat2 shapes cannot be multipliedPrint .shape before every @; remember A@B needs A cols == B rows
Forgetting to transposeWrong result silentlyFor dot similarity: (seq, d) @ (d, seq) = (seq, seq); use .T or .transpose(-1,-2)
Not normalizing before cosine simLength-biased similarityAlways divide by norm or use F.normalize()
Overflow in attention scoresNaN in softmaxScale by 1/√d_k before softmax
Debugging tip: Print tensor shapes at every step when debugging a new architecture. shape issues compound — catch them early.

Gradient descent and backpropagation geometry

Gradient descent navigates the loss landscape by computing the gradient — a vector pointing in the direction of steepest loss increase — and stepping in the opposite direction. The learning rate determines step size, and choosing it correctly is critical: too large causes oscillation or divergence, too small causes slow convergence. The Adam optimizer adapts the learning rate per parameter using running averages of gradients and squared gradients, effectively normalizing gradient updates by their historical magnitude to achieve more stable convergence than vanilla gradient descent across the diverse parameter scales found in transformer models.

Dimensionality and the curse of dimensionality

High-dimensional embedding spaces exhibit counter-intuitive geometric properties that affect similarity search. In high dimensions, the ratio of the maximum to minimum pairwise distance between random points converges to 1 as dimensionality increases — meaning that cosine similarity scores cluster near a narrow range, making it harder to distinguish relevant from irrelevant embeddings. This concentration of measure explains why embedding models must be carefully trained to use the embedding space efficiently, and why dimensionality reduction techniques like PCA or UMAP produce 2D visualizations that appear to show clear clustering even when the original high-dimensional space has no such clear structure.

ConceptTransformer roleKey insight
Matrix multiplicationAttention computation (QKᵀ)Parallelizable — GPU-efficient
Dot productSimilarity score in attentionScale by √d_k to stabilize gradients
Softmax normalizationAttention weight distributionConverts scores to probabilities
Singular value decompositionLow-rank approximation (LoRA)Compresses weight updates efficiently

Principal component analysis (PCA) applied to LLM activations and attention patterns reveals the internal structure of learned representations. The top principal components of the residual stream across layers capture the most variance in model behavior — often corresponding to interpretable directions like sentiment, topic, or factual content. Mechanistic interpretability research uses PCA and related decompositions to identify "circuits" — subsets of attention heads and MLP neurons that implement specific computations. Understanding these internal structures is increasingly important for model steering and intervention techniques that modify model behavior by adding vectors to the residual stream rather than modifying weights.

Batched matrix multiplication is the core operation that makes transformer inference efficient on modern GPUs. The attention computation for a batch of B sequences, each of length L, with d_k-dimensional keys and queries, computes BL × d_k multiplied by d_k × BL (keys transposed) to produce BL × BL attention scores — a single batched matrix operation that GPUs execute at their peak FLOP throughput. CUDA's cuBLAS library and PyTorch's torch.matmul use highly optimized GEMM (General Matrix Multiply) kernels that achieve close to theoretical peak throughput for these attention-sized matrix products, making matrix multiplication efficiency the dominant factor in transformer inference speed.

Low-rank matrix decomposition underlies LoRA, the most widely used parameter-efficient fine-tuning method. LoRA decomposes the weight update ΔW into two low-rank matrices A (d × r) and B (r × d) where rank r ≪ d. The product BA approximates the full-rank update with only r(d_in + d_out) parameters instead of d_in × d_out, enabling fine-tuning that updates 0.1–1% of the effective parameters compared to full fine-tuning. The mathematical justification is that useful weight updates during fine-tuning are empirically low-rank — they affect a small subspace of the model's representation — making the low-rank approximation a good fit for task-specific adaptation.

Tensor operations in PyTorch follow broadcasting rules inherited from NumPy that enable element-wise operations between tensors of different shapes without explicit replication. Understanding broadcasting is essential for correctly implementing attention score scaling, positional encoding addition, and mask application. The most common broadcasting mistake in attention implementation is applying masks of shape (batch, heads, seq, seq) to attention scores of incompatible shapes — using tensor.shape print statements and the einops library's rearrange function to explicitly track dimension semantics prevents the silent shape errors that cause attention to compute on wrong token pairs.

Rotary Position Encoding (RoPE), used in LLaMA and many other recent models, applies a rotation transformation to the query and key vectors at each attention layer based on absolute token position. The key mathematical insight is that rotating vectors by position-dependent angles produces inner products that naturally decay with relative position distance, implementing position-relative attention without adding explicit position embeddings to the token representations. This algebraic property makes RoPE compatible with key-value cache extension techniques like position interpolation, enabling context length extension by scaling the rotation frequencies.