Theory & Architecture

GenAI Foundations

The math, ML theory, and architecture that powers transformers and LLMs. Everything practitioners need to understand how models work.

3 Theory Layers
5 Key Topics
On This Page
01 — Overview

The Three Foundation Layers

All of GenAI rests on three layers of theory. Understanding them helps you make better decisions about model selection, fine-tuning, and system design. You don't need a PhD, but you need to know enough to reason about tradeoffs.

Layer 1: Math — How Models Learn

Linear algebra, calculus, and probability form the bedrock. How loss functions guide optimization, how gradients flow backward through networks, how probability constrains outputs. You don't need to derive these from first principles, but you need to understand: gradient descent, backpropagation, softmax, and cross-entropy loss. These appear in every model, every framework, every debugging session.

Layer 2: Machine Learning Core — The Abstractions

Supervised learning, loss functions, regularization, and evaluation. How we train models to predict targets from inputs, how we prevent overfitting, how we measure what actually matters. This is where concepts like train/test splits, validation, and metrics come from.

Layer 3: Architecture — The Decisions

Transformers, attention mechanisms, residual connections, and scaling laws. Why transformers work, how attention computes relevance, why bigger models are often better, and what tradeoffs exist between model size, latency, and cost.

💡 Why this matters: Every problem you'll face in production — slow inference, model failures, hallucinations, overfitting — can be traced to one of these three layers. Understanding the layers helps you diagnose problems and choose solutions.
02 — Foundations

Layer 1: Math

The mathematical principles that let neural networks learn from data. You need to understand the conceptual picture, not the detailed derivations.

Gradient Descent

How models learn. Start with random weights. Compute how wrong the prediction is (loss). Ask: if I nudge each weight up or down, does loss go up or down? Follow the direction that reduces loss. Repeat. This is gradient descent: walking downhill on a loss landscape to find good weights.

Backpropagation

How to efficiently compute which weights to adjust. Instead of testing each weight independently (too slow), backpropagation computes the gradient of loss with respect to every weight by walking backward through the network. Chain rule: the gradient at each layer depends on the gradient of the layer above it.

Softmax and Cross-Entropy

How to turn raw logits into probabilities and measure wrongness. Softmax converts any numbers into a probability distribution (sums to 1, all non-negative). Cross-entropy measures distance between predicted and true distribution. Together, they're the standard for classification: predict softmax probabilities, penalize distance from truth via cross-entropy loss.

Attention Mechanism

How transformers decide which parts of the input are relevant. Compute a relevance score between every pair of input tokens (query-key dot product). Soften these scores (softmax). Use them to weight the values. Result: each token attends to other tokens proportional to relevance. This is differentiable, so it learns which tokens matter.

Essential intuitions: Gradient descent finds good weights by walking downhill. Backpropagation computes gradients efficiently. Softmax+cross-entropy are the standard classification setup. Attention computes weighted relevance. These four ideas appear in every transformer, every LLM, every fine-tuning run.
03 — Theory

Layer 2: Machine Learning Core

The abstractions and principles that govern how models generalize from training data to real data.

Train/Test Split

You have limited data. Train on some, test on held-out data you never trained on. If your model performs well on training data but poorly on test data, it's overfitted: it memorized training examples instead of learning general patterns.

Overfitting and Regularization

The fundamental problem: models can memorize noise. Solutions: (1) early stopping (stop training before overfitting), (2) regularization (penalize large weights, discouraging memorization), (3) data augmentation (feed varied examples), (4) ensemble (combine many models). Fine-tuning suffers from overfitting — you have fewer examples than pre-training used.

Evaluation Metrics

Accuracy isn't always right. Precision/recall matter when false positives and false negatives have different costs. F1 balances them. BLEU and ROUGE measure text generation quality. Perplexity measures how surprised the model is by held-out text. Pick the metric that matches your actual goal.

Scaling Laws and Emergent Abilities

Bigger models perform better — but not linearly. Compute scaling laws (more data and parameters → better loss) follow power laws. Interestingly, certain capabilities only emerge at larger scales. GPT-3 at 175B can do few-shot learning; GPT-2 at 1.5B cannot. This is why there's an arms race for bigger models.

⚠️ Pitfall: Don't trust training loss alone. Monitor validation loss and test metrics. A model that achieves 100% training accuracy is almost certainly overfitting. Use proper evaluation, or your model will fail in production.
04 — Transformers

Layer 3: Architecture

Specific design choices in transformers and LLMs that make them powerful and trainable.

Transformer Architecture

Encoder-decoder (for translation), encoder-only (BERT), or decoder-only (GPT). Decoder-only is now standard for LLMs. Stack of identical layers: each layer applies multi-head attention (many attention mechanisms in parallel) followed by a feedforward network. Skip connections allow gradients to flow. Layer norm stabilizes training.

Tokenization

Convert text to integers. Models work on tokens, not characters. Subword tokenization (BPE, SentencePiece) balance vocabulary size and compression. Your model's context window is measured in tokens, not bytes. "GPT-4 has 128K tokens" means 128K tokens, not 128K characters.

Positional Encoding

Attention is permutation-invariant: it doesn't care about order. So we inject position information into embeddings. Absolute (sinusoidal, learned embeddings) or relative (biases on attention scores). Without position, a transformer can't distinguish "dog bites man" from "man bites dog."

Scaling Laws and Model Size

Bigger is better, but with tradeoffs. Larger models require more compute (training and inference), longer latency, higher cost. Parameters ∝ compute. GPT-3 (175B) → more capable, slower, expensive. Smaller models (7B, 13B) are faster and cheaper but less capable. The Pareto frontier shifts with inference optimization (quantization, distillation).

Mixture of Experts (MoE)

Instead of one dense network in each layer, have many expert networks and learn a routing function that decides which experts to use. Each token uses a sparse subset of experts, reducing compute while maintaining model capacity. Enables scaling to very large models (GPT-4 likely uses MoE).

Architecture insight: Transformers are "just" stacked attention and feedforward layers with good scaling properties. The secret sauce is: (1) attention (learns relevance), (2) depth (many layers), (3) scale (huge models), (4) data (massive training sets). No silver bullets.
Python · Minimal transformer block in PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.d_k)
        Q, K, V = qkv.permute(2, 0, 3, 1, 4)  # each: (B, heads, T, d_k)
        scores = Q @ K.transpose(-2, -1) / self.d_k ** 0.5
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        weights = F.softmax(scores, dim=-1)
        out = (weights @ V).transpose(1, 2).reshape(B, T, C)
        return self.out(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, ff_mult: int = 4):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_model * ff_mult),
            nn.GELU(),
            nn.Linear(d_model * ff_mult, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.attn(self.norm1(x))   # residual + attention
        x = x + self.ff(self.norm2(x))     # residual + FFN
        return x

# Test: 2-layer mini transformer
block = TransformerBlock(d_model=256, n_heads=8)
x = torch.randn(2, 16, 256)   # batch=2, seq=16, d=256
out = block(x)
print(f"Input shape: {x.shape} → Output: {out.shape}")  # same shape
05 — Study

Recommended Learning Path

Don't try to learn everything at once. Follow this path, practicing with code at each step.

Month 1: Math Intuition

Goal: Understand gradient descent, backprop, and loss. Do: Implement gradient descent from scratch on a toy problem (fit a line to points). Read one chapter on linear algebra (matrix multiplication, transposes). Watch an intuitive video on backpropagation. Time: 5-10 hours.

Month 2: ML Core

Goal: Train a model end-to-end, understand overfitting. Do: Use PyTorch or TensorFlow to build a simple classifier. Split data into train/test. Monitor training vs validation loss. Experiment with regularization. Time: 10-15 hours.

Month 3: Transformers

Goal: Understand transformer architecture. Do: Read "Attention Is All You Need" (the original transformer paper — it's readable). Understand multi-head attention conceptually. Play with a pre-trained transformer in Hugging Face. Time: 10-15 hours.

Month 4: LLMs and Scaling

Goal: Understand how LLMs work and why scale matters. Do: Fine-tune a small LLM on a custom dataset. Read about scaling laws. Experiment with prompt engineering. Time: 15-20 hours.

Month 5+: Applied Topics

Goal: Deep dive into what you need for your work (RAG, agents, multimodal). Do: Build an application. Time: Open-ended.

💡 Learning by doing: Theory without code is useless. Theory with code is powerful. At each stage, build something, break it, fix it, understand why.
WeekTopicKey ConceptsTime
1Linear AlgebraVectors, matrices, dot product, eigenvalues10h
2Probability & StatisticsDistributions, Bayes, MLE, entropy, KL divergence10h
3Calculus for MLGradients, chain rule, partial derivatives, Jacobians8h
4Classical MLRegression, classification, cross-validation, bias-variance12h
5–6Deep LearningBackprop, CNNs, RNNs, optimizers, regularization20h
7–8TransformersAttention, positional encoding, BERT/GPT architecture20h
9+LLM EngineeringRAG, fine-tuning, RLHF, evaluation, deploymentOngoing
06 — Essentials

Minimal Viable Math for LLM Work

You don't need a PhD in mathematics to be effective with LLMs. Here's what you actually need to know.

ESSENTIAL MATH CONCEPTS: 1. GRADIENT DESCENT Loss = (prediction - truth)² Gradient = direction that reduces loss Update weight += learning_rate * gradient Repeat until loss is small enough 2. BACKPROPAGATION (Chain Rule) Error at output layer → error at hidden layer dL/dW = (dL/dOutput) * (dOutput/dInput) * (dInput/dW) Autograd does this for you (PyTorch, TensorFlow) 3. SOFTMAX (Converting Logits to Probabilities) logits = [2.0, 1.0, 0.1] (raw model outputs) probs = softmax(logits) (normalize to distribution) = [0.66, 0.24, 0.10] (sums to 1) 4. ATTENTION (Query-Key-Value) For each token, compute relevance to other tokens relevance = softmax(query @ key.T / sqrt(d)) output = relevance @ values Result: weighted sum of values, weights learned 5. SCALING & LOSS CURVES more_params → lower_loss (power law) bigger_model → often better (usually worth it) overfitting = train_loss drops but val_loss rises solution: regularization, more data, early stopping

Why This Is Enough

These five concepts cover 80% of what you need. You can train models, debug failures, and make architectural decisions with just this knowledge. The remaining 20% is domain-specific (NLP, vision, multimodal, etc.).

Professional secret: Most ML engineers don't memorize the math. They understand the intuition and consult the formulas when needed. Focus on intuition first, derivations later (if ever).
Python · Core math operations that underpin every LLM (NumPy)
import numpy as np

# 1. Cosine similarity — used in embedding search
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cos_sim:.4f}")   # 0.9746

# 2. Softmax — converts logits to probability distribution
def softmax(x):
    e = np.exp(x - x.max())  # subtract max for numerical stability
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Softmax: {probs.round(3)}")  # [0.659, 0.242, 0.099]

# 3. Scaled dot-product attention (single head)
def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)  # scale prevents vanishing gradients
    weights = softmax(scores)         # each row sums to 1.0
    return weights @ V

seq_len, d_model = 8, 32
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)
output = attention(Q, K, V)  # shape: (8, 32)
print(f"Attention output shape: {output.shape}")

# 4. Cross-entropy loss — used in language model training
def cross_entropy(logits, target_idx):
    probs = softmax(logits)
    return -np.log(probs[target_idx] + 1e-9)

vocab_size = 50000
logits = np.random.randn(vocab_size)
loss = cross_entropy(logits, target_idx=42)
print(f"Cross-entropy loss: {loss:.4f}")
07 — Explore

Deep Dives: Foundation Topics

Each foundation topic deserves its own deep study. Start with whichever is weakest for you.

Foundation Clusters

1

Mathematical Foundations

Linear algebra, calculus, and probability. Gradients, optimization, and loss functions that power all neural networks.

2

ML Core

Supervised learning, regularization, evaluation, and scaling laws. The principles that govern how models generalize.

3

Transformers

Attention mechanisms, positional encoding, and the architecture that powers modern LLMs.

4

Language Models

How LLMs work: tokenization, pre-training, instruction fine-tuning, and emergent capabilities at scale.

5

Multimodal Models

Vision transformers, CLIP, and extending transformers to images, audio, and video.

Study order: Math → ML Core → Transformers → LLMs → Multimodal. Don't skip the early layers or later topics become harder.
08 — Quick Start

Quick-Start Foundation Study Guide

One week of focused study to build foundation intuition. Combine reading, videos, and hands-on code.

Day 1: Gradient Descent

Read: Chapter 2 of "Neural Networks from Scratch" or watch Andrew Ng's Coursera. Code: Implement gradient descent fitting a parabola. Visualize: Plot the loss landscape and your descent path. Time: 2 hours.

Day 2: Backpropagation

Watch: 3Blue1Brown video on backpropagation (20 min). Code: Build a tiny neural network (3 layers) from scratch, forward pass and backward pass. Time: 2 hours.

Day 3: PyTorch Basics

Read: PyTorch tutorials on autograd. Code: Rewrite your tiny network in PyTorch, see how autograd computes gradients. Time: 2 hours.

Day 4: Classification and Loss

Read: Softmax and cross-entropy. Code: Build a classifier on MNIST (handwritten digits). Train it. Plot training vs validation loss. Time: 3 hours.

Day 5: Attention Intuition

Watch: "Attention is All You Need" explained visually. Read: The attention mechanism section of the original paper (skip the math). Code: Implement a toy attention mechanism (5 lines of numpy). Time: 2 hours.

Day 6: Transformer Block

Read: How multi-head attention and feedforward layers combine. Code: Build one transformer block from scratch using PyTorch. Time: 2 hours.

Day 7: Scaling and LLMs

Read: Chinchilla scaling laws, emergent abilities. Code: Fine-tune a small LLM on a custom dataset. Time: 2 hours.

⚠️ Time commitment: This is 15-20 hours of focused work. Don't rush. If you get stuck, that's where deep learning is hiding — understand it thoroughly before moving on.
09 — Further Reading

References

Foundational Papers
Textbooks & Courses
Libraries & Implementation