Foundations — GenAI Mindmap

Contents

Why foundations matter
The math
Python scientific stack
PyTorch basics
Learning path
Child pages
References

01 — Importance

Why Foundations Matter

Strong fundamentals in math, Python, and ML theory accelerate learning of everything else. Trying to understand transformers without linear algebra is like trying to understand music theory without hearing intervals. The concepts click when you see them grounded in math and working code.

This section covers the essential bedrock: linear algebra (matrices, vectors), calculus (derivatives, chain rule), probability (distributions, independence), Python (NumPy, Pandas), and PyTorch (tensor operations, autograd).

💡 Learning strategy: Learn by doing. Theory alone won't stick. Write code to implement concepts — matrix multiply, gradient computation, layer forward passes. Formulas become intuition through implementation.

Topic	Why It Matters for LLMs	Key Concept	Resource
Linear algebra	Every layer is matrix multiply + activation	Matrix multiply, dot product, norms	3Blue1Brown Essence of LA
Calculus	Backpropagation is chain rule applied repeatedly	Partial derivatives, chain rule	Karpathy micrograd
Probability	LLMs output token probability distributions	Softmax, cross-entropy, KL divergence	Blitzstein & Hwang
Optimization	Training = gradient descent toward lower loss	SGD, Adam, learning rate schedules	fast.ai Practical DL
NumPy / PyTorch	Implement every concept as runnable code	Tensor ops, autograd, nn.Module	PyTorch tutorials

02 — Math Essentials

The Math: Linear Algebra, Calculus, Probability

Linear Algebra Essentials

Vectors: 1D arrays. Operations: dot product, norm, distance. Matrices: 2D arrays. Operations: multiply, transpose, inverse, determinant. Eigenvalues & eigenvectors: For understanding covariance, PCA. Vector spaces & subspaces: Foundation for dimensionality reduction.

Calculus Essentials

Derivatives: Rate of change. Used to find loss minima (gradient descent). Partial derivatives: Derivative w.r.t. one variable. Essential for multivariable optimization. Chain rule: Compute derivatives of composite functions. The core of backpropagation. Gradient: Vector of partial derivatives. Points in direction of steepest ascent.

Probability Essentials

Distributions: Normal (Gaussian), uniform, categorical. Expectation & variance: Mean and spread of random variables. Independence: Two events/variables are independent if P(A,B) = P(A)P(B). Bayes' rule: P(A|B) = P(B|A)P(A) / P(B). Fundamental to Bayesian inference.

✓ You don't need every proof: Understand intuition and derivation, not proofs from first principles. Know what matrix multiply does; you don't need to prove it's associative.

Python · NumPy fundamentals: vectors, matrices, and broadcasting

import numpy as np

# Vectors and basic operations
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([4.0, 5.0, 6.0])

print(f"Dot product: {np.dot(v1, v2)}")         # 32.0
print(f"L2 norm: {np.linalg.norm(v1):.4f}")     # 3.7417
print(f"Cosine sim: {np.dot(v1,v2) / (np.linalg.norm(v1)*np.linalg.norm(v2)):.4f}")  # 0.9746

# Matrix operations (the core of neural networks)
A = np.random.randn(4, 8)   # 4 rows, 8 columns
B = np.random.randn(8, 16)  # 8 rows, 16 columns
C = A @ B                   # matrix multiply → shape (4, 16)
print(f"A @ B shape: {C.shape}")

# Broadcasting — apply same operation to each row
batch = np.random.randn(32, 512)   # 32 tokens, 512-dim embeddings
scale = np.random.randn(512)       # per-dimension scale
scaled = batch * scale             # broadcasts: (32, 512) * (512,) → (32, 512)

# Eigenvalues — used in PCA, understanding attention patterns
M = np.random.randn(4, 4)
M = M @ M.T                        # make symmetric positive definite
eigenvalues, eigenvectors = np.linalg.eigh(M)
print(f"Eigenvalues: {eigenvalues.round(2)}")

03 — Python Ecosystem

Python Scientific Stack: NumPy, Pandas, Matplotlib

NumPy: Arrays and Linear Algebra

NumPy is the foundation of Python data science. It provides multidimensional arrays (ndarrays) and linear algebra operations. All matrices and vectors in GenAI work are NumPy or PyTorch tensors (which extend NumPy).

Pandas: DataFrames and Tabular Data

Pandas provides DataFrames: labeled 2D tables (like SQL or Excel). Essential for EDA (exploratory data analysis), loading datasets, handling missing values, groupby aggregations.

Matplotlib & Seaborn: Visualization

Matplotlib is the low-level plotting library. Seaborn is a higher-level wrapper for statistical plots. Essential for understanding data distributions, loss curves, attention patterns.

💡 Jupyter notebooks: For exploration and learning, use Jupyter. Interactive, visual, great for iterating. For production code, use .py files and version control.

Python · Probability essentials used in language models

import numpy as np

# 1. Probability distributions
def softmax(x):
    e = np.exp(x - x.max())
    return e / e.sum()

logits = np.array([3.0, 1.0, 0.2])
probs = softmax(logits)
print(f"Softmax: {probs.round(3)}")  # [0.844, 0.114, 0.042]

# 2. Sampling with temperature
def sample_with_temperature(logits, temperature=1.0):
    """Higher temp = more random; lower temp = more greedy."""
    scaled = logits / temperature
    probs = softmax(scaled)
    return np.random.choice(len(probs), p=probs)

token = sample_with_temperature(logits, temperature=0.7)

# 3. Cross-entropy loss — the training objective for LLMs
def cross_entropy(logits, target_idx):
    probs = softmax(logits)
    return -np.log(probs[target_idx] + 1e-9)

loss = cross_entropy(logits, target_idx=0)   # high confidence → low loss
print(f"Cross-entropy loss: {loss:.4f}")     # 0.1699

# 4. KL Divergence — measures distribution distance (used in RLHF)
def kl_divergence(p, q):
    """KL(P || Q) — how different Q is from the reference P."""
    return np.sum(p * np.log(p / (q + 1e-9) + 1e-9))

# Reference model vs fine-tuned model output distributions
p_ref = softmax(np.array([2.0, 1.5, 0.5]))
p_new = softmax(np.array([1.8, 1.6, 0.8]))
kl = kl_divergence(p_ref, p_new)
print(f"KL divergence: {kl:.4f}")   # small = fine-tuned close to reference

06 — Shortcut

Fast Track: What You Actually Need Day 1

You don't need to master all of calculus, linear algebra, and probability before building. A practical minimum for starting LLM engineering work: understand what a matrix multiply is (it's a weighted combination), understand that softmax converts numbers to probabilities, understand that training minimizes a loss function via gradient steps. That's enough to read papers and understand what's happening.

The deeper foundations pay off over time — especially when debugging training instabilities, reading architecture papers, or designing evaluation metrics. But don't block shipping on a math prerequisite. Build first, deepen foundations in parallel.

Python · The 5 NumPy / math operations every LLM engineer needs

import numpy as np

# 1. Matrix multiply — the core of every transformer layer
W = np.random.randn(512, 512)   # weight matrix
x = np.random.randn(512)         # input token embedding
out = W @ x                      # linear projection: (512,)

# 2. Softmax — converts raw scores to a probability distribution
def softmax(logits):
    e = np.exp(logits - logits.max())  # subtract max for stability
    return e / e.sum()

vocab_logits = np.random.randn(50000)  # one score per token
token_probs = softmax(vocab_logits)    # sums to 1.0
next_token = token_probs.argmax()      # greedy decoding

# 3. Cross-entropy loss — measures how wrong the model's prediction is
def cross_entropy(probs, target_idx):
    return -np.log(probs[target_idx] + 1e-9)  # add eps to avoid log(0)

loss = cross_entropy(token_probs, target_idx=42)

# 4. Cosine similarity — used in embedding search
def cos_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

emb1 = np.random.randn(1536)  # OpenAI text-embedding-3-small dimension
emb2 = np.random.randn(1536)
similarity = cos_sim(emb1, emb2)  # range: -1 to 1

# 5. Layer norm — stabilises training, used in every transformer block
def layer_norm(x, eps=1e-5):
    mean = x.mean()
    std  = x.std()
    return (x - mean) / (std + eps)

normed = layer_norm(out)  # zero mean, unit variance

04 — PyTorch Framework

PyTorch Basics

Tensors: The Core Data Structure

PyTorch tensors are arrays on GPU/CPU with autograd support. They're like NumPy arrays but can compute gradients automatically. Shapes matter: (batch, seq_len, hidden_dim) is standard for transformer inputs.

Autograd: Automatic Differentiation

Set requires_grad=True on input tensors. Perform operations. Call .backward() on loss. PyTorch traces the computation graph and computes gradients automatically via chain rule. This is the core of all training.

torch.nn: Building Blocks

Modules encapsulate layers: Linear, Conv2d, Embedding, Dropout, etc. Subclass nn.Module to define custom architectures. Register parameters so optimizer can find them.

Optimization

torch.optim provides optimizers: SGD, Adam, AdamW. Create optimizer with torch.optim.Adam(model.parameters(), lr=...). Call zero_grad() before backprop, then step() after.

Code Example: Foundations in Practice

import numpy as np
import torch

# 1. Matrix multiplication (linear algebra)
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = A @ B
print(f'Matrix mult: {C}')

# 2. Gradient computation (calculus/autograd)
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1
y.backward()
print(f'dy/dx at x=2: {x.grad}')  # Should be 7.0

# 3. Softmax + entropy (probability)
logits = torch.tensor([1.0, 2.0, 0.5])
probs = torch.softmax(logits, dim=0)
entropy = -(probs * torch.log(probs)).sum()
print(f'Entropy: {entropy:.4f}')

# 4. Mini neural network forward pass
x = torch.randn(4, 10)   # batch=4, features=10
W = torch.randn(10, 5)   # weight matrix
b = torch.randn(5)        # bias
y = torch.relu(x @ W + b)
print(f'Output shape: {y.shape}')  # [4, 5]

05 — Study Plan

Learning Path (1-2 weeks)

Week 1: Math + Python

Day 1-2: Linear algebra fundamentals. Vectors, matrices, dot product, transpose, matrix multiply. 3Blue1Brown "Essence of Linear Algebra" on YouTube. Day 3-4: Calculus. Derivatives, partial derivatives, chain rule. Understand gradients geometrically. Day 5: Probability. Distributions, independence, Bayes. Day 6-7: NumPy hands-on. Implement matrix operations, create random distributions, basic aggregations.

Week 2: PyTorch + Integration

Day 8-9: PyTorch tensors. Create tensors, reshape, slicing, broadcasting. Day 10-11: Autograd. Write a function, compute gradients, understand the computation graph. Day 12: Build a tiny neural network. Define forward, compute loss, backprop, update weights manually. Day 13-14: Small projects. Fit a network to toy data. Implement simple optimization loop.

⚠️ Reinforce with code: Formulas alone won't stick. Implement concepts: matrix multiply from scratch, gradient of loss w.r.t. weights, softmax by hand. Then use PyTorch and verify results match.

06 — Deep Dives

Child Pages

Explore each foundation in detail:

→ Math Foundations → Python Ecosystem → PyTorch Basics

07 — Further Learning

References

Video Courses

Video 3Blue1Brown. (2016). Essence of Linear Algebra. YouTube — youtube.com ↗
Course Fast.ai. (2022). Practical Deep Learning for Coders. — course.fast.ai ↗

Books & Texts

Book Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for Machine Learning. — mml-book.github.io ↗

Official Documentation

Docs PyTorch. (2025). PyTorch Tutorials & Documentation. — pytorch.org ↗
Docs NumPy. (2025). NumPy Quickstart & Documentation. — numpy.org ↗

GenAI Foundations

Why Foundations Matter

The Math: Linear Algebra, Calculus, Probability

Linear Algebra Essentials

Calculus Essentials

Probability Essentials

Python Scientific Stack: NumPy, Pandas, Matplotlib

NumPy: Arrays and Linear Algebra

Pandas: DataFrames and Tabular Data

Matplotlib & Seaborn: Visualization

Fast Track: What You Actually Need Day 1

PyTorch Basics

Tensors: The Core Data Structure

Autograd: Automatic Differentiation

torch.nn: Building Blocks

Optimization

Code Example: Foundations in Practice

Learning Path (1-2 weeks)

Week 1: Math + Python

Week 2: PyTorch + Integration

Child Pages

References

Related concepts