The math, ML theory, and architecture that powers transformers and LLMs. Everything practitioners need to understand how models work.
All of GenAI rests on three layers of theory. Understanding them helps you make better decisions about model selection, fine-tuning, and system design. You don't need a PhD, but you need to know enough to reason about tradeoffs.
Linear algebra, calculus, and probability form the bedrock. How loss functions guide optimization, how gradients flow backward through networks, how probability constrains outputs. You don't need to derive these from first principles, but you need to understand: gradient descent, backpropagation, softmax, and cross-entropy loss. These appear in every model, every framework, every debugging session.
Supervised learning, loss functions, regularization, and evaluation. How we train models to predict targets from inputs, how we prevent overfitting, how we measure what actually matters. This is where concepts like train/test splits, validation, and metrics come from.
Transformers, attention mechanisms, residual connections, and scaling laws. Why transformers work, how attention computes relevance, why bigger models are often better, and what tradeoffs exist between model size, latency, and cost.
The mathematical principles that let neural networks learn from data. You need to understand the conceptual picture, not the detailed derivations.
How models learn. Start with random weights. Compute how wrong the prediction is (loss). Ask: if I nudge each weight up or down, does loss go up or down? Follow the direction that reduces loss. Repeat. This is gradient descent: walking downhill on a loss landscape to find good weights.
How to efficiently compute which weights to adjust. Instead of testing each weight independently (too slow), backpropagation computes the gradient of loss with respect to every weight by walking backward through the network. Chain rule: the gradient at each layer depends on the gradient of the layer above it.
How to turn raw logits into probabilities and measure wrongness. Softmax converts any numbers into a probability distribution (sums to 1, all non-negative). Cross-entropy measures distance between predicted and true distribution. Together, they're the standard for classification: predict softmax probabilities, penalize distance from truth via cross-entropy loss.
How transformers decide which parts of the input are relevant. Compute a relevance score between every pair of input tokens (query-key dot product). Soften these scores (softmax). Use them to weight the values. Result: each token attends to other tokens proportional to relevance. This is differentiable, so it learns which tokens matter.
The abstractions and principles that govern how models generalize from training data to real data.
You have limited data. Train on some, test on held-out data you never trained on. If your model performs well on training data but poorly on test data, it's overfitted: it memorized training examples instead of learning general patterns.
The fundamental problem: models can memorize noise. Solutions: (1) early stopping (stop training before overfitting), (2) regularization (penalize large weights, discouraging memorization), (3) data augmentation (feed varied examples), (4) ensemble (combine many models). Fine-tuning suffers from overfitting — you have fewer examples than pre-training used.
Accuracy isn't always right. Precision/recall matter when false positives and false negatives have different costs. F1 balances them. BLEU and ROUGE measure text generation quality. Perplexity measures how surprised the model is by held-out text. Pick the metric that matches your actual goal.
Bigger models perform better — but not linearly. Compute scaling laws (more data and parameters → better loss) follow power laws. Interestingly, certain capabilities only emerge at larger scales. GPT-3 at 175B can do few-shot learning; GPT-2 at 1.5B cannot. This is why there's an arms race for bigger models.
Specific design choices in transformers and LLMs that make them powerful and trainable.
Encoder-decoder (for translation), encoder-only (BERT), or decoder-only (GPT). Decoder-only is now standard for LLMs. Stack of identical layers: each layer applies multi-head attention (many attention mechanisms in parallel) followed by a feedforward network. Skip connections allow gradients to flow. Layer norm stabilizes training.
Convert text to integers. Models work on tokens, not characters. Subword tokenization (BPE, SentencePiece) balance vocabulary size and compression. Your model's context window is measured in tokens, not bytes. "GPT-4 has 128K tokens" means 128K tokens, not 128K characters.
Attention is permutation-invariant: it doesn't care about order. So we inject position information into embeddings. Absolute (sinusoidal, learned embeddings) or relative (biases on attention scores). Without position, a transformer can't distinguish "dog bites man" from "man bites dog."
Bigger is better, but with tradeoffs. Larger models require more compute (training and inference), longer latency, higher cost. Parameters ∝ compute. GPT-3 (175B) → more capable, slower, expensive. Smaller models (7B, 13B) are faster and cheaper but less capable. The Pareto frontier shifts with inference optimization (quantization, distillation).
Instead of one dense network in each layer, have many expert networks and learn a routing function that decides which experts to use. Each token uses a sparse subset of experts, reducing compute while maintaining model capacity. Enables scaling to very large models (GPT-4 likely uses MoE).
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
assert d_model % n_heads == 0
self.d_k = d_model // n_heads
self.n_heads = n_heads
self.qkv = nn.Linear(d_model, 3 * d_model)
self.out = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
B, T, C = x.shape
qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.d_k)
Q, K, V = qkv.permute(2, 0, 3, 1, 4) # each: (B, heads, T, d_k)
scores = Q @ K.transpose(-2, -1) / self.d_k ** 0.5
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
out = (weights @ V).transpose(1, 2).reshape(B, T, C)
return self.out(out)
class TransformerBlock(nn.Module):
def __init__(self, d_model: int, n_heads: int, ff_mult: int = 4):
super().__init__()
self.attn = MultiHeadAttention(d_model, n_heads)
self.ff = nn.Sequential(
nn.Linear(d_model, d_model * ff_mult),
nn.GELU(),
nn.Linear(d_model * ff_mult, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
x = x + self.attn(self.norm1(x)) # residual + attention
x = x + self.ff(self.norm2(x)) # residual + FFN
return x
# Test: 2-layer mini transformer
block = TransformerBlock(d_model=256, n_heads=8)
x = torch.randn(2, 16, 256) # batch=2, seq=16, d=256
out = block(x)
print(f"Input shape: {x.shape} → Output: {out.shape}") # same shape
Don't try to learn everything at once. Follow this path, practicing with code at each step.
Goal: Understand gradient descent, backprop, and loss. Do: Implement gradient descent from scratch on a toy problem (fit a line to points). Read one chapter on linear algebra (matrix multiplication, transposes). Watch an intuitive video on backpropagation. Time: 5-10 hours.
Goal: Train a model end-to-end, understand overfitting. Do: Use PyTorch or TensorFlow to build a simple classifier. Split data into train/test. Monitor training vs validation loss. Experiment with regularization. Time: 10-15 hours.
Goal: Understand transformer architecture. Do: Read "Attention Is All You Need" (the original transformer paper — it's readable). Understand multi-head attention conceptually. Play with a pre-trained transformer in Hugging Face. Time: 10-15 hours.
Goal: Understand how LLMs work and why scale matters. Do: Fine-tune a small LLM on a custom dataset. Read about scaling laws. Experiment with prompt engineering. Time: 15-20 hours.
Goal: Deep dive into what you need for your work (RAG, agents, multimodal). Do: Build an application. Time: Open-ended.
| Week | Topic | Key Concepts | Time |
|---|---|---|---|
| 1 | Linear Algebra | Vectors, matrices, dot product, eigenvalues | 10h |
| 2 | Probability & Statistics | Distributions, Bayes, MLE, entropy, KL divergence | 10h |
| 3 | Calculus for ML | Gradients, chain rule, partial derivatives, Jacobians | 8h |
| 4 | Classical ML | Regression, classification, cross-validation, bias-variance | 12h |
| 5–6 | Deep Learning | Backprop, CNNs, RNNs, optimizers, regularization | 20h |
| 7–8 | Transformers | Attention, positional encoding, BERT/GPT architecture | 20h |
| 9+ | LLM Engineering | RAG, fine-tuning, RLHF, evaluation, deployment | Ongoing |
You don't need a PhD in mathematics to be effective with LLMs. Here's what you actually need to know.
These five concepts cover 80% of what you need. You can train models, debug failures, and make architectural decisions with just this knowledge. The remaining 20% is domain-specific (NLP, vision, multimodal, etc.).
import numpy as np
# 1. Cosine similarity — used in embedding search
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cos_sim:.4f}") # 0.9746
# 2. Softmax — converts logits to probability distribution
def softmax(x):
e = np.exp(x - x.max()) # subtract max for numerical stability
return e / e.sum()
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Softmax: {probs.round(3)}") # [0.659, 0.242, 0.099]
# 3. Scaled dot-product attention (single head)
def attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # scale prevents vanishing gradients
weights = softmax(scores) # each row sums to 1.0
return weights @ V
seq_len, d_model = 8, 32
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)
output = attention(Q, K, V) # shape: (8, 32)
print(f"Attention output shape: {output.shape}")
# 4. Cross-entropy loss — used in language model training
def cross_entropy(logits, target_idx):
probs = softmax(logits)
return -np.log(probs[target_idx] + 1e-9)
vocab_size = 50000
logits = np.random.randn(vocab_size)
loss = cross_entropy(logits, target_idx=42)
print(f"Cross-entropy loss: {loss:.4f}")
Each foundation topic deserves its own deep study. Start with whichever is weakest for you.
Linear algebra, calculus, and probability. Gradients, optimization, and loss functions that power all neural networks.
Supervised learning, regularization, evaluation, and scaling laws. The principles that govern how models generalize.
Attention mechanisms, positional encoding, and the architecture that powers modern LLMs.
How LLMs work: tokenization, pre-training, instruction fine-tuning, and emergent capabilities at scale.
Vision transformers, CLIP, and extending transformers to images, audio, and video.
One week of focused study to build foundation intuition. Combine reading, videos, and hands-on code.
Read: Chapter 2 of "Neural Networks from Scratch" or watch Andrew Ng's Coursera. Code: Implement gradient descent fitting a parabola. Visualize: Plot the loss landscape and your descent path. Time: 2 hours.
Watch: 3Blue1Brown video on backpropagation (20 min). Code: Build a tiny neural network (3 layers) from scratch, forward pass and backward pass. Time: 2 hours.
Read: PyTorch tutorials on autograd. Code: Rewrite your tiny network in PyTorch, see how autograd computes gradients. Time: 2 hours.
Read: Softmax and cross-entropy. Code: Build a classifier on MNIST (handwritten digits). Train it. Plot training vs validation loss. Time: 3 hours.
Watch: "Attention is All You Need" explained visually. Read: The attention mechanism section of the original paper (skip the math). Code: Implement a toy attention mechanism (5 lines of numpy). Time: 2 hours.
Read: How multi-head attention and feedforward layers combine. Code: Build one transformer block from scratch using PyTorch. Time: 2 hours.
Read: Chinchilla scaling laws, emergent abilities. Code: Fine-tune a small LLM on a custom dataset. Time: 2 hours.