Perceptrons, backpropagation, activation functions, and batch normalisation — the building blocks of deep learning
A neural network is stacked matrix multiplications plus nonlinearities. Each layer applies y = Wx + b, where W is a learnable weight matrix, x is input, b is bias. Then a nonlinear activation (ReLU, GELU) is applied. The composition of many layers creates expressive function approximators.
The weight matrix has shape (out_features, in_features), transposed for efficient matrix multiplication. Forward pass: y = x @ W.T + b. Stacking many such layers with nonlinearities creates deep networks capable of learning complex mappings.
| Function | Formula | Range | Dying ReLU | LLM use |
|---|---|---|---|---|
| Sigmoid | 1 / (1 + e^-x) | [0, 1] | No | Rare |
| Tanh | (e^x - e^-x) / (e^x + e^-x) | [-1, 1] | No | Rare |
| ReLU | max(0, x) | [0, inf) | Yes | Legacy |
| GELU | x * Phi(x) | [-0.2, x] | No | Standard |
| SiLU | x * sigmoid(x) | [-0.18, x] | No | Modern |
| Swish | x * sigmoid(beta*x) | [-0.18, x] | No | Common |
ReLU is fast but suffers from dying units: neurons that output 0 permanently stop learning. GELU (smooth approximation to ReLU) avoids this; standard in transformers. SiLU/Swish are modern alternatives with slightly better empirical performance on large models.
LLMs use GELU or SiLU in feedforward layers. Both allow gradients to flow through inactive regions and have negative outputs for small x, preventing output collapse.
Backpropagation computes gradients via the chain rule. Loss L depends on output y, which depends on hidden layer h, which depends on input x and weights W. Chain rule: dL/dW = dL/dy * dy/dh * dh/dW. PyTorch's autograd automates this; you call backward() and gradients are computed end-to-end.
PyTorch builds a dynamic computation graph during forward pass. Each operation records how to compute its gradient. backward() traverses the graph, computing gradients at each node. This is efficient: gradients are computed only for nodes needed to reach the loss.
Key insight: Backprop is automatic differentiation. No hand-coded gradients needed. This enables rapid experimentation and complex architectures.
| Type | Normalizes | When used | LLMs |
|---|---|---|---|
| BatchNorm | Per feature, across batch | CNNs, old RNNs | No |
| LayerNorm | Per sample, across features | Transformers | Yes (standard) |
| RMSNorm | Per sample, L2 norm (no mean) | Modern transformers | Yes (modern) |
| GroupNorm | Per group of features | Variable batch size | Rare |
LLMs use LayerNorm (or RMSNorm) because: (1) sequence length varies, so batch statistics are unstable; (2) normalizing per sample (not batch) prevents inter-example coupling; (3) applied before attention/FFN, stabilizes training. Applied post-activation in LLaMA-style architecture.
Weights initialized poorly → gradients vanish or explode → training fails. Standard approaches: Xavier/Glorot (uniform/normal with scaling for layer size), He/Kaiming (larger variance for ReLU), scaled init (custom scaling based on model architecture).
At initialization, each layer output has some variance. If variance shrinks layer-by-layer, gradients vanish. If variance grows, gradients explode. Proper initialization keeps variance ~1 across layers, allowing stable gradient flow.
LLMs use careful initialization: weights drawn from a narrow distribution, then scaled by 1/sqrt(N) where N is layer width. This keeps activation variance stable during forward and backward pass. See GPT/LLaMA codebases for exact formulas.
If gradient norm exceeds threshold, scale down. Prevents exploding gradients from large loss spike. Standard: clip gradients to norm 1.0 during training. Essential for transformers; less critical for modern optimizers (AdamW with weight decay).
Start with tiny learning rate, gradually increase over first 10K steps, then decay. Warmup stabilizes early training when loss landscape is rough. Common: linear warmup to peak LR, then cosine decay.
Sudden loss spike: gradient norm exploded. Reduce learning rate or increase gradient clip threshold. Loss plateaus: stuck in local minimum or learning rate too small. Increase LR or change optimizer. Slow convergence: batch size too small or learning rate too low.
Dynamic computation graphs, autograd, GPU support. Industry standard for research and production LLMs.
Functional array library with autograd. Good for custom kernels and XLA compilation. Used at DeepMind.
JAX-based neural network library. Explicit parameter handling. Growing in LLM space.
Static/eager graph modes. Less popular for LLMs; strong for production serving.
High-level API. Good for quick prototyping; less control than raw TensorFlow/PyTorch.
Visualize PyTorch computation graphs. Debugging tool for understanding forward/backward pass.
Experiment tracking, hyperparameter logging, model versioning. Standard for ML teams.
Model interpretability for PyTorch. Understand which inputs/neurons matter for predictions.