FOUNDATIONS

GenAI Foundations

Math, Python, and core ML concepts — the prerequisites that make everything else click

linear algebra + calculus the math
NumPy → PyTorch the stack
1-2 weeks the investment
Contents
  1. Why foundations matter
  2. The math
  3. Python scientific stack
  4. PyTorch basics
  5. Learning path
  6. Child pages
  7. References
01 — Importance

Why Foundations Matter

Strong fundamentals in math, Python, and ML theory accelerate learning of everything else. Trying to understand transformers without linear algebra is like trying to understand music theory without hearing intervals. The concepts click when you see them grounded in math and working code.

This section covers the essential bedrock: linear algebra (matrices, vectors), calculus (derivatives, chain rule), probability (distributions, independence), Python (NumPy, Pandas), and PyTorch (tensor operations, autograd).

💡 Learning strategy: Learn by doing. Theory alone won't stick. Write code to implement concepts — matrix multiply, gradient computation, layer forward passes. Formulas become intuition through implementation.
TopicWhy It Matters for LLMsKey ConceptResource
Linear algebraEvery layer is matrix multiply + activationMatrix multiply, dot product, norms3Blue1Brown Essence of LA
CalculusBackpropagation is chain rule applied repeatedlyPartial derivatives, chain ruleKarpathy micrograd
ProbabilityLLMs output token probability distributionsSoftmax, cross-entropy, KL divergenceBlitzstein & Hwang
OptimizationTraining = gradient descent toward lower lossSGD, Adam, learning rate schedulesfast.ai Practical DL
NumPy / PyTorchImplement every concept as runnable codeTensor ops, autograd, nn.ModulePyTorch tutorials
02 — Math Essentials

The Math: Linear Algebra, Calculus, Probability

Linear Algebra Essentials

Vectors: 1D arrays. Operations: dot product, norm, distance. Matrices: 2D arrays. Operations: multiply, transpose, inverse, determinant. Eigenvalues & eigenvectors: For understanding covariance, PCA. Vector spaces & subspaces: Foundation for dimensionality reduction.

Calculus Essentials

Derivatives: Rate of change. Used to find loss minima (gradient descent). Partial derivatives: Derivative w.r.t. one variable. Essential for multivariable optimization. Chain rule: Compute derivatives of composite functions. The core of backpropagation. Gradient: Vector of partial derivatives. Points in direction of steepest ascent.

Probability Essentials

Distributions: Normal (Gaussian), uniform, categorical. Expectation & variance: Mean and spread of random variables. Independence: Two events/variables are independent if P(A,B) = P(A)P(B). Bayes' rule: P(A|B) = P(B|A)P(A) / P(B). Fundamental to Bayesian inference.

You don't need every proof: Understand intuition and derivation, not proofs from first principles. Know what matrix multiply does; you don't need to prove it's associative.
Python · NumPy fundamentals: vectors, matrices, and broadcasting
import numpy as np

# Vectors and basic operations
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([4.0, 5.0, 6.0])

print(f"Dot product: {np.dot(v1, v2)}")         # 32.0
print(f"L2 norm: {np.linalg.norm(v1):.4f}")     # 3.7417
print(f"Cosine sim: {np.dot(v1,v2) / (np.linalg.norm(v1)*np.linalg.norm(v2)):.4f}")  # 0.9746

# Matrix operations (the core of neural networks)
A = np.random.randn(4, 8)   # 4 rows, 8 columns
B = np.random.randn(8, 16)  # 8 rows, 16 columns
C = A @ B                   # matrix multiply → shape (4, 16)
print(f"A @ B shape: {C.shape}")

# Broadcasting — apply same operation to each row
batch = np.random.randn(32, 512)   # 32 tokens, 512-dim embeddings
scale = np.random.randn(512)       # per-dimension scale
scaled = batch * scale             # broadcasts: (32, 512) * (512,) → (32, 512)

# Eigenvalues — used in PCA, understanding attention patterns
M = np.random.randn(4, 4)
M = M @ M.T                        # make symmetric positive definite
eigenvalues, eigenvectors = np.linalg.eigh(M)
print(f"Eigenvalues: {eigenvalues.round(2)}")
03 — Python Ecosystem

Python Scientific Stack: NumPy, Pandas, Matplotlib

NumPy: Arrays and Linear Algebra

NumPy is the foundation of Python data science. It provides multidimensional arrays (ndarrays) and linear algebra operations. All matrices and vectors in GenAI work are NumPy or PyTorch tensors (which extend NumPy).

Pandas: DataFrames and Tabular Data

Pandas provides DataFrames: labeled 2D tables (like SQL or Excel). Essential for EDA (exploratory data analysis), loading datasets, handling missing values, groupby aggregations.

Matplotlib & Seaborn: Visualization

Matplotlib is the low-level plotting library. Seaborn is a higher-level wrapper for statistical plots. Essential for understanding data distributions, loss curves, attention patterns.

💡 Jupyter notebooks: For exploration and learning, use Jupyter. Interactive, visual, great for iterating. For production code, use .py files and version control.
Python · Probability essentials used in language models
import numpy as np

# 1. Probability distributions
def softmax(x):
    e = np.exp(x - x.max())
    return e / e.sum()

logits = np.array([3.0, 1.0, 0.2])
probs = softmax(logits)
print(f"Softmax: {probs.round(3)}")  # [0.844, 0.114, 0.042]

# 2. Sampling with temperature
def sample_with_temperature(logits, temperature=1.0):
    """Higher temp = more random; lower temp = more greedy."""
    scaled = logits / temperature
    probs = softmax(scaled)
    return np.random.choice(len(probs), p=probs)

token = sample_with_temperature(logits, temperature=0.7)

# 3. Cross-entropy loss — the training objective for LLMs
def cross_entropy(logits, target_idx):
    probs = softmax(logits)
    return -np.log(probs[target_idx] + 1e-9)

loss = cross_entropy(logits, target_idx=0)   # high confidence → low loss
print(f"Cross-entropy loss: {loss:.4f}")     # 0.1699

# 4. KL Divergence — measures distribution distance (used in RLHF)
def kl_divergence(p, q):
    """KL(P || Q) — how different Q is from the reference P."""
    return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

# Reference model vs fine-tuned model output distributions
p_ref = softmax(np.array([2.0, 1.5, 0.5]))
p_new = softmax(np.array([1.8, 1.6, 0.8]))
kl = kl_divergence(p_ref, p_new)
print(f"KL divergence: {kl:.4f}")   # small = fine-tuned close to reference
06 — Shortcut

Fast Track: What You Actually Need Day 1

You don't need to master all of calculus, linear algebra, and probability before building. A practical minimum for starting LLM engineering work: understand what a matrix multiply is (it's a weighted combination), understand that softmax converts numbers to probabilities, understand that training minimizes a loss function via gradient steps. That's enough to read papers and understand what's happening.

The deeper foundations pay off over time — especially when debugging training instabilities, reading architecture papers, or designing evaluation metrics. But don't block shipping on a math prerequisite. Build first, deepen foundations in parallel.

Python · The 5 NumPy / math operations every LLM engineer needs
import numpy as np

# 1. Matrix multiply — the core of every transformer layer
W = np.random.randn(512, 512)   # weight matrix
x = np.random.randn(512)         # input token embedding
out = W @ x                      # linear projection: (512,)

# 2. Softmax — converts raw scores to a probability distribution
def softmax(logits):
    e = np.exp(logits - logits.max())  # subtract max for stability
    return e / e.sum()

vocab_logits = np.random.randn(50000)  # one score per token
token_probs = softmax(vocab_logits)    # sums to 1.0
next_token = token_probs.argmax()      # greedy decoding

# 3. Cross-entropy loss — measures how wrong the model's prediction is
def cross_entropy(probs, target_idx):
    return -np.log(probs[target_idx] + 1e-9)  # add eps to avoid log(0)

loss = cross_entropy(token_probs, target_idx=42)

# 4. Cosine similarity — used in embedding search
def cos_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

emb1 = np.random.randn(1536)  # OpenAI text-embedding-3-small dimension
emb2 = np.random.randn(1536)
similarity = cos_sim(emb1, emb2)  # range: -1 to 1

# 5. Layer norm — stabilises training, used in every transformer block
def layer_norm(x, eps=1e-5):
    mean = x.mean()
    std  = x.std()
    return (x - mean) / (std + eps)

normed = layer_norm(out)  # zero mean, unit variance
04 — PyTorch Framework

PyTorch Basics

Tensors: The Core Data Structure

PyTorch tensors are arrays on GPU/CPU with autograd support. They're like NumPy arrays but can compute gradients automatically. Shapes matter: (batch, seq_len, hidden_dim) is standard for transformer inputs.

Autograd: Automatic Differentiation

Set requires_grad=True on input tensors. Perform operations. Call .backward() on loss. PyTorch traces the computation graph and computes gradients automatically via chain rule. This is the core of all training.

torch.nn: Building Blocks

Modules encapsulate layers: Linear, Conv2d, Embedding, Dropout, etc. Subclass nn.Module to define custom architectures. Register parameters so optimizer can find them.

Optimization

torch.optim provides optimizers: SGD, Adam, AdamW. Create optimizer with torch.optim.Adam(model.parameters(), lr=...). Call zero_grad() before backprop, then step() after.

Code Example: Foundations in Practice

import numpy as np
import torch

# 1. Matrix multiplication (linear algebra)
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = A @ B
print(f'Matrix mult: {C}')

# 2. Gradient computation (calculus/autograd)
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1
y.backward()
print(f'dy/dx at x=2: {x.grad}')  # Should be 7.0

# 3. Softmax + entropy (probability)
logits = torch.tensor([1.0, 2.0, 0.5])
probs = torch.softmax(logits, dim=0)
entropy = -(probs * torch.log(probs)).sum()
print(f'Entropy: {entropy:.4f}')

# 4. Mini neural network forward pass
x = torch.randn(4, 10)   # batch=4, features=10
W = torch.randn(10, 5)   # weight matrix
b = torch.randn(5)        # bias
y = torch.relu(x @ W + b)
print(f'Output shape: {y.shape}')  # [4, 5]
05 — Study Plan

Learning Path (1-2 weeks)

Week 1: Math + Python

Day 1-2: Linear algebra fundamentals. Vectors, matrices, dot product, transpose, matrix multiply. 3Blue1Brown "Essence of Linear Algebra" on YouTube. Day 3-4: Calculus. Derivatives, partial derivatives, chain rule. Understand gradients geometrically. Day 5: Probability. Distributions, independence, Bayes. Day 6-7: NumPy hands-on. Implement matrix operations, create random distributions, basic aggregations.

Week 2: PyTorch + Integration

Day 8-9: PyTorch tensors. Create tensors, reshape, slicing, broadcasting. Day 10-11: Autograd. Write a function, compute gradients, understand the computation graph. Day 12: Build a tiny neural network. Define forward, compute loss, backprop, update weights manually. Day 13-14: Small projects. Fit a network to toy data. Implement simple optimization loop.

⚠️ Reinforce with code: Formulas alone won't stick. Implement concepts: matrix multiply from scratch, gradient of loss w.r.t. weights, softmax by hand. Then use PyTorch and verify results match.
06 — Deep Dives

Child Pages

Explore each foundation in detail:

07 — Further Learning

References

Video Courses
Books & Texts
Official Documentation