Calculus & Autodiff

Why Calculus?
Derivatives & Gradients
The Chain Rule
Backprop from Scratch
PyTorch Autograd
Numerical vs Analytical Gradients

SECTION 01

Why Calculus?

Training a neural network means finding weights that minimize a loss function. Calculus tells you which direction to move each weight to reduce the loss — that direction is the negative gradient.

Core loop: Forward pass (compute loss) → backward pass (compute gradients) → optimizer step (update weights). Calculus powers the backward pass.

Derivative: Rate of change. dL/dw tells you how much loss changes per unit change in weight w.
Gradient: Vector of all partial derivatives. ∇L = [∂L/∂w₁, ∂L/∂w₂, ...]
Gradient descent: w ← w - α·∂L/∂w. Move in the direction of steepest descent.

SECTION 02

Derivatives & Gradients

A derivative measures how a function changes. For f(x) = x², f'(x) = 2x — at x=3, a unit increase in x increases f by ~6.

# Key derivatives to memorize for ML: # f(x) = x² → f'(x) = 2x # f(x) = eˣ → f'(x) = eˣ # f(x) = ln(x) → f'(x) = 1/x # f(x) = sigmoid → f'(x) = σ(x)(1 - σ(x)) # f(x) = relu(x) → f'(x) = 1 if x>0 else 0 # Loss: cross-entropy # L = -y·log(ŷ) - (1-y)·log(1-ŷ) # ∂L/∂ŷ = -y/ŷ + (1-y)/(1-ŷ) # Gradient of MSE loss w.r.t. predictions import numpy as np y_pred = np.array([0.9, 0.2, 0.7]) y_true = np.array([1.0, 0.0, 1.0]) mse = ((y_pred - y_true)**2).mean() grad = 2 * (y_pred - y_true) / len(y_pred) # ∂MSE/∂ŷ print(f"MSE: {mse:.4f}, Gradient: {grad}")

SECTION 03

The Chain Rule

The chain rule computes gradients through composed functions. For h(x) = f(g(x)), dh/dx = df/dg · dg/dx. This is how gradients flow backward through layers.

# Chain rule through a 2-layer network: # L = loss(ŷ) where ŷ = W2 @ relu(W1 @ x) # # ∂L/∂W1 = ∂L/∂ŷ · ∂ŷ/∂h · ∂h/∂W1 # # Breaking it down: # ∂L/∂ŷ = gradient of loss w.r.t. output # ∂ŷ/∂h = W2 (linear layer jacobian) # ∂h/∂W1 = relu_grad * x (elementwise relu × input) # # Each layer multiplies its local Jacobian into the gradient flowing back. # Deep networks = many chain rule applications.

The key insight: gradients flow backward through exactly the same path that activations flow forward. Each node in the computation graph has a local gradient, and these multiply together via the chain rule.

SECTION 04

Backprop from Scratch

Backpropagation is just the chain rule applied systematically to a computation graph.

import numpy as np # Simple 1-hidden-layer network: loss = MSE(W2 @ relu(W1 @ x), y) np.random.seed(42) x = np.random.randn(4) # input y = np.array([1.0]) # target W1 = np.random.randn(3, 4) * 0.1 W2 = np.random.randn(1, 3) * 0.1 # Forward pass h1 = W1 @ x # (3,) a1 = np.maximum(0, h1) # relu y_pred = W2 @ a1 # (1,) loss = ((y_pred - y)**2).mean() # Backward pass (chain rule) dL_dy = 2 * (y_pred - y) / len(y) # ∂L/∂ŷ dL_dW2 = dL_dy[:, None] @ a1[None, :] # (1,3) dL_da1 = W2.T @ dL_dy # (3,) — back through W2 dL_dh1 = dL_da1 * (h1 > 0) # relu backward: 0 where h1<=0 dL_dW1 = dL_dh1[:, None] @ x[None, :] # (3,4) print(f"Loss: {loss:.4f}") print(f"dL/dW1 shape: {dL_dW1.shape}")

SECTION 05

PyTorch Autograd

PyTorch's autograd engine does all the above automatically by tracing operations on tensors with requires_grad=True.

import torch import torch.nn as nn # Autograd tracks everything x = torch.randn(4) y = torch.tensor([1.0]) W1 = torch.randn(3, 4, requires_grad=True) * 0.1 W2 = torch.randn(1, 3, requires_grad=True) * 0.1 # Forward pass (same as before) h1 = W1 @ x a1 = torch.relu(h1) y_pred = W2 @ a1 loss = ((y_pred - y)**2).mean() # Backward — compute all gradients loss.backward() print(W1.grad) # Same as our manual dL_dW1! print(W2.grad) # With an optimizer optimizer = torch.optim.Adam([W1, W2], lr=1e-3) optimizer.zero_grad() # Clear old gradients loss.backward() # Compute gradients optimizer.step() # Update weights

Key rule: Always call optimizer.zero_grad() before loss.backward(). Gradients accumulate by default — a common bug for beginners.

SECTION 06

Numerical vs Analytical Gradients

You can check your gradient implementation by comparing to finite differences (numerical gradient). Useful for debugging custom layers.

import torch def numerical_gradient(f, x, eps=1e-5): """Finite difference gradient check.""" grad = torch.zeros_like(x) for i in range(x.numel()): x_plus = x.clone(); x_plus.flatten()[i] += eps x_minus = x.clone(); x_minus.flatten()[i] -= eps grad.flatten()[i] = (f(x_plus) - f(x_minus)) / (2 * eps) return grad # Check: should match x.grad after backward() x = torch.randn(5, requires_grad=True) def f(t): return (t**3 + 2*t).sum() # Analytical loss = f(x); loss.backward() analytic = x.grad.clone() # Numerical numeric = numerical_gradient(f, x.detach()) print(f"Max difference: {(analytic - numeric).abs().max():.2e}") # Should be ~1e-6 or smaller

When gradients explode or vanish: Check norms with for p in model.parameters(): print(p.grad.norm()). Gradient clipping (torch.nn.utils.clip_grad_norm_) is the standard fix.

Multivariate & Vector Calculus

Extensions to multiple variables enable optimization and analysis of complex systems. Multivariate calculus introduces partial derivatives, the gradient vector, and higher-order derivatives that capture how functions change across multiple dimensions. The gradient points in the direction of steepest increase, making it fundamental to optimization algorithms. Understanding multivariate calculus is essential for advanced machine learning, as most real-world problems involve hundreds or millions of variables.

Vector calculus extends these concepts further with operations like divergence, curl, and line integrals. While full vector calculus may seem abstract, its practical importance is immense. Optimization algorithms navigate high-dimensional spaces using these concepts, and understanding the geometry of these spaces improves intuition about how learning algorithms behave.

import numpy as np
from scipy.optimize import minimize

# Multivariate function (Rosenbrock)
def f(x):
    return 100*(x[1]-x[0]**2)**2 + (1-x[0])**2

# Gradient (partial derivatives)
def grad_f(x):
    dx = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
    dy = 200*(x[1]-x[0]**2)
    return np.array([dx, dy])

# Optimization using gradient
x0 = np.array([0.0, 0.0])
result = minimize(f, x0, jac=grad_f, method='BFGS')
print(f"Optimal point: {result.x}")

Calculus in ML & AI Systems

Calculus provides the mathematical foundation for gradient descent and neural network training. Every parameter update during model training relies on calculating partial derivatives via backpropagation. The chain rule enables computing gradients through multiple layers, making deep learning possible. Understanding these calculus concepts deeply helps practitioners debug training issues, design better optimization algorithms, and improve model convergence.

Beyond standard supervised learning, calculus underpins advanced techniques like variational inference, adversarial training, and reinforcement learning. Reinforcement learning relies on policy gradients computed via calculus. Variational autoencoders use calculus-based reparameterization tricks. As machine learning evolves, deeper calculus knowledge enables innovation in algorithm design.

Concept	Application in ML	Importance
Gradient	Parameter updates	Critical
Chain rule	Backpropagation	Critical
Hessian	Second-order optimization	High
Jacobian	Sensitivity analysis	Medium

Developing strong mathematical foundations in calculus accelerates progress in machine learning. While modern frameworks automate gradient computation, understanding the underlying calculus enables principled algorithm design, effective debugging, and innovation in pushing the boundaries of what machine learning systems can accomplish.

The intuition behind chain rule is visualizing how small changes propagate through nested functions. In neural networks, the chain rule enables computing how input perturbations affect the final output through multiple layers. Understanding chain rule deeply provides insight into backpropagation and helps diagnose training issues like vanishing or exploding gradients. The rule fundamentally enables deep learning as we know it.

Numerical differentiation versus analytical gradients represents an important practical distinction. Numerical gradients computed via finite differences provide reliable but computationally expensive verification. Analytical gradients computed via backpropagation are efficient but potentially error-prone if implemented incorrectly. Professional ML systems always verify analytical gradients against numerical gradients during development. This gradient checking practice prevents subtle bugs that cause training failures.

Advanced optimization techniques build on calculus foundations. Second-order methods use Hessian matrices to capture curvature information, enabling more sophisticated updates than first-order gradient descent. Natural gradient descent uses information geometry to adapt updates for specific problem structures. Developing and understanding these techniques requires solid calculus foundations and geometric intuition.

Calculus connects deep to fundamental machine learning concepts including loss surfaces, convergence, and generalization. Understanding how optimization algorithms navigate loss landscapes—finding minima, navigating saddle points, escaping local optima—requires calculus-based geometric reasoning. This deeper understanding enables designing better algorithms and understanding why existing algorithms work.

The computational complexity of computing derivatives through backpropagation is nearly identical to forward passes. This remarkable efficiency enables training deep networks with millions of parameters. Understanding this cost equivalence helps practitioners grasp why backpropagation revolutionized deep learning—it made training deep networks computationally feasible. Without backpropagation's efficiency, modern deep learning would be impractical.

Taylor series expansions provide intuition for how optimization algorithms work. Gradient descent can be viewed as following first-order Taylor approximations locally. Newton's method uses second-order approximations. Understanding these connections deepens insight into optimization and explains why different methods have different convergence properties and behavior near local minima and saddle points.

Automatic differentiation systems in frameworks like PyTorch and TensorFlow implement calculus computations efficiently. These systems transform mathematical operations into computation graphs, automatically computing derivatives through the graph. While users rarely implement automatic differentiation themselves, understanding how it works enables effective debugging of gradient-related issues and designing custom operations that compose properly.

Advanced topics building on calculus foundations include optimal transport, generative models, and reinforcement learning. Wasserstein distances in optimal transport use calculus-based formulations. Variational inference relies on calculus-based optimization. These cutting-edge techniques all rest on solid calculus foundations, making continued study of calculus worthwhile for researchers pushing the boundaries of machine learning.

Calculus & Autodiff

Table of Contents

Why Calculus?

Derivatives & Gradients

The Chain Rule

Backprop from Scratch

PyTorch Autograd

Numerical vs Analytical Gradients

Multivariate & Vector Calculus

Calculus in ML & AI Systems

Related concepts