Foundations · PyTorch

PyTorch Fundamentals

Tensors, autograd, nn.Module, DataLoaders, and training loops — the PyTorch mental model every ML practitioner needs

5 Core Concepts
7 Sections
Python-first Code Examples
In this guide
  1. Tensors
  2. Autograd
  3. nn.Module
  4. DataLoader
  5. Training Loop
  6. GPU Acceleration
  7. Model Persistence
  8. Tools & Ecosystem
01 — Foundation

Tensors: The Core Data Structure

Tensors are multidimensional arrays — the atomic unit of PyTorch computation. Everything flows through tensors: inputs, weights, gradients, outputs. Understanding tensor operations is the foundation of PyTorch proficiency.

Creation and Basic Properties

Tensors can be created from Python lists, NumPy arrays, or generated directly. The three properties you must understand are shape (dimensions), dtype (data type), and device (CPU or GPU).

import torch # Create from list x = torch.tensor([1.0, 2.0, 3.0]) print(x.shape) # torch.Size([3]) # Create 2D tensor y = torch.randn(batch_size=32, features=128) print(y.shape) # torch.Size([32, 128]) # Check properties print(x.dtype) # torch.float32 print(y.device) # cpu # Convert dtype z = x.float() # ensure float32 w = y.long() # convert to int64 # Move to GPU if torch.cuda.is_available(): x_gpu = x.cuda() # or: x_gpu = x.to('cuda:0')

Broadcasting and Reshaping

PyTorch follows NumPy broadcasting semantics. Shapes are automatically aligned from the right. Views and reshapes allow you to change tensor shape without copying data (views) or with copying (.reshape()).

# Broadcasting a = torch.ones(3, 1, 4) # shape [3, 1, 4] b = torch.ones(1, 5, 4) # shape [1, 5, 4] c = a + b # broadcasts to [3, 5, 4] # View vs reshape x = torch.arange(6) # [0,1,2,3,4,5] y = x.view(2, 3) # reshape without copy (faster) z = x.reshape(3, 2) # reshape with possible copy # Contiguous memory matters z_t = z.t() # transposed view z_c = z_t.contiguous() # convert to C-contiguous for some ops

Indexing and Selection

Tensors support advanced indexing just like NumPy. Understand the difference between fancy indexing (which returns a copy) and views (which are references).

# Basic indexing x = torch.randn(10, 5) first_row = x[0] # shape [5] first_col = x[:, 0] # shape [10] subset = x[2:5, 1:4] # shape [3, 3] # Masking mask = x > 0 positive = x[mask] # fancy indexing # Gather and scatter indices = torch.tensor([0, 2, 4]) gathered = x[indices] # select rows 0, 2, 4
02 — Automatic Differentiation

Autograd: Computing Gradients

Autograd is PyTorch's automatic differentiation system. It tracks operations on tensors and computes gradients via backpropagation. Every tensor has a requires_grad flag that tells PyTorch to record operations on it for the backward pass.

Gradient Tracking and Backward

When you set requires_grad=True, PyTorch builds a computation graph. Calling .backward() triggers backpropagation, computing gradients for all tensors in the graph. Gradients accumulate in the .grad attribute.

import torch # Simple scalar backward x = torch.tensor(2.0, requires_grad=True) y = x ** 2 + 3 * x + 1 y.backward() print(x.grad) # dy/dx = 2x + 3 = 7.0 # Vector backward (implicit reduction) x = torch.randn(5, requires_grad=True) y = (x ** 2).sum() y.backward() print(x.grad) # grad[i] = 2 * x[i] # Manual gradient specification x = torch.randn(3, requires_grad=True) y = x * 2 y.backward(torch.ones_like(y)) # specify output gradient print(x.grad) # [2, 2, 2]

Gradient Accumulation and Zeroing

Gradients accumulate by default when you call .backward() multiple times on the same tensor. In training loops, you must zero gradients before each backward pass to prevent accumulation.

# Gradient accumulation x = torch.tensor(3.0, requires_grad=True) for i in range(3): y = x ** 2 y.backward() print(x.grad) # 6.0 + 6.0 + 6.0 = 18.0 # Correct pattern: zero_grad before backward x = torch.tensor(3.0, requires_grad=True) for i in range(3): x.grad = None # or: x.grad.zero_() y = x ** 2 y.backward() print(x.grad) # 6.0 (correct)

no_grad Context and Inference

Use torch.no_grad() context to disable gradient tracking during inference or when you need to modify tensors without affecting gradients. This saves memory and computation.

# Inference: disable gradient tracking model.eval() with torch.no_grad(): logits = model(x_test) predictions = logits.argmax(dim=-1) # Manual gradient updates (weight decay example) with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad # detach() breaks the computation graph y = x ** 2 z = (y.detach()) ** 2 # z won't flow gradients back to x
03 — Model Architecture

nn.Module: Building Model Layers

nn.Module is the base class for all PyTorch models. It organizes parameters, buffers, and submodules. When you define a custom model, inherit from nn.Module, initialize layers in __init__, and implement forward().

Basic Module Structure

Every module tracks its parameters and submodules automatically. When you call model(x), it invokes the forward() method. Parameters become attributes and are automatically registered for optimization.

import torch import torch.nn as nn class SimpleMLP(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super().__init__() self.fc1 = nn.Linear(input_dim, hidden_dim) self.relu = nn.ReLU() self.fc2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): x = self.fc1(x) x = self.relu(x) x = self.fc2(x) return x model = SimpleMLP(128, 256, 10) print(model) # shows layer structure # Access parameters for name, param in model.named_parameters(): print(name, param.shape) # Forward pass x = torch.randn(32, 128) # batch_size=32, features=128 output = model(x) # calls forward() print(output.shape) # [32, 10]

Parameter Management and state_dict

Models store learned parameters as nn.Parameter objects. Use state_dict() to get all parameters and buffers as a dictionary — essential for checkpointing and distributed training.

# Save and load model state torch.save(model.state_dict(), 'checkpoint.pt') # Later: load state into a fresh model model = SimpleMLP(128, 256, 10) model.load_state_dict(torch.load('checkpoint.pt')) # Freeze parameters for param in model.fc1.parameters(): param.requires_grad = False # Count parameters total_params = sum(p.numel() for p in model.parameters()) trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"Total: {total_params}, Trainable: {trainable_params}")

Custom Layers and Modules

Extend nn.Module to create custom layers. Register parameters with nn.Parameter() and submodules with direct assignment in __init__. PyTorch automatically tracks them.

class GateLayer(nn.Module): def __init__(self, in_features, out_features): super().__init__() self.weight = nn.Parameter(torch.randn(out_features, in_features)) self.gate_weight = nn.Parameter(torch.randn(out_features, in_features)) self.bias = nn.Parameter(torch.zeros(out_features)) def forward(self, x): gate = torch.sigmoid(x @ self.gate_weight.t()) output = (x @ self.weight.t() + self.bias) * gate return output
04 — Data Pipeline

DataLoader: Batching and Sampling

DataLoader handles the data pipeline: sampling batches, shuffling, collating samples, and prefetching. It wraps a Dataset (which provides individual samples) and handles multiprocessing for efficient I/O.

Dataset and DataLoader Basics

Implement a custom Dataset by inheriting from torch.utils.data.Dataset. Implement __len__() to return dataset size and __getitem__() to return individual samples. Pass to DataLoader for batching.

from torch.utils.data import Dataset, DataLoader import torch class CustomDataset(Dataset): def __init__(self, inputs, labels): self.inputs = inputs self.labels = labels def __len__(self): return len(self.inputs) def __getitem__(self, idx): return self.inputs[idx], self.labels[idx] # Create dataset dataset = CustomDataset( inputs=torch.randn(1000, 128), labels=torch.randint(0, 10, (1000,)) ) # Create DataLoader dataloader = DataLoader( dataset, batch_size=32, shuffle=True, num_workers=4, pin_memory=True ) # Iterate for batch_inputs, batch_labels in dataloader: print(batch_inputs.shape) # [32, 128] print(batch_labels.shape) # [32] break

Collate Functions and Custom Batching

The collate_fn merges samples into a batch. Default stacking works for fixed-size tensors. For variable-length data (e.g., sequences), implement custom collate functions to pad or truncate.

def pad_collate_fn(batch): """Pad sequences to the same length in a batch.""" inputs, labels = zip(*batch) # Pad sequences to max length in batch max_len = max(len(x) for x in inputs) padded = torch.zeros(len(inputs), max_len) for i, x in enumerate(inputs): padded[i, :len(x)] = x labels = torch.stack(labels) return padded, labels dataloader = DataLoader( dataset, batch_size=32, collate_fn=pad_collate_fn, num_workers=0 # Custom collate often requires num_workers=0 )

Optimization: num_workers and pin_memory

num_workers > 0 spawns background processes to load data in parallel while the GPU trains. pin_memory=True pre-allocates pinned CPU memory for faster GPU transfers. Use both for maximum throughput.

# Optimal DataLoader for GPU training train_loader = DataLoader( train_dataset, batch_size=128, shuffle=True, num_workers=4, # 4 background workers loading data pin_memory=True, # pinned CPU memory persistent_workers=True, # keep workers alive between epochs prefetch_factor=2 # prefetch 2 batches per worker ) # For small datasets or debugging, use num_workers=0 debug_loader = DataLoader( small_dataset, batch_size=8, num_workers=0 )
05 — Core Training Pattern

The Training Loop: Putting It Together

The training loop is the core pattern: forward pass, compute loss, backward pass, optimizer step. Repeat over batches. Validation happens at epoch boundaries to detect overfitting.

Basic Training Loop

import torch import torch.nn as nn import torch.optim as optim model = SimpleMLP(128, 256, 10) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=1e-3) device = 'cuda' if torch.cuda.is_available() else 'cpu' model.to(device) num_epochs = 10 for epoch in range(num_epochs): model.train() # enable training mode (dropout, batchnorm) epoch_loss = 0.0 for batch_inputs, batch_labels in train_loader: # Move to device batch_inputs = batch_inputs.to(device) batch_labels = batch_labels.to(device) # Forward pass outputs = model(batch_inputs) loss = criterion(outputs, batch_labels) # Backward pass optimizer.zero_grad() loss.backward() # Optimizer step optimizer.step() epoch_loss += loss.item() # Validation at epoch boundary model.eval() val_loss = 0.0 with torch.no_grad(): for batch_inputs, batch_labels in val_loader: batch_inputs = batch_inputs.to(device) batch_labels = batch_labels.to(device) outputs = model(batch_inputs) val_loss += criterion(outputs, batch_labels).item() print(f"Epoch {epoch+1}: train_loss={epoch_loss/len(train_loader):.4f}, val_loss={val_loss/len(val_loader):.4f}")

Advanced: Gradient Accumulation and Mixed Precision

For large models that don't fit with large batches, accumulate gradients over multiple forward passes before stepping the optimizer. Mixed precision training uses float16 for speed and float32 for stability.

# Gradient accumulation for larger effective batch size accumulation_steps = 4 optimizer.zero_grad() for batch_idx, (inputs, labels) in enumerate(train_loader): inputs, labels = inputs.to(device), labels.to(device) outputs = model(inputs) loss = criterion(outputs, labels) / accumulation_steps loss.backward() if (batch_idx + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() # Mixed precision with autocast from torch.amp import autocast, GradScaler scaler = GradScaler() for inputs, labels in train_loader: inputs, labels = inputs.to(device), labels.to(device) with autocast(device_type='cuda'): outputs = model(inputs) loss = criterion(outputs, labels) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() optimizer.zero_grad()

Checkpointing and Early Stopping

Save model checkpoints during training to recover from crashes or resume training. Implement early stopping to stop training when validation performance plateaus, preventing overfitting.

best_val_loss = float('inf') patience = 3 patience_counter = 0 for epoch in range(num_epochs): # ... training code ... # Validation val_loss = evaluate(model, val_loader, device) # Checkpointing if val_loss < best_val_loss: best_val_loss = val_loss patience_counter = 0 torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'val_loss': val_loss }, 'best_checkpoint.pt') else: patience_counter += 1 # Early stopping if patience_counter >= patience: print(f"Early stopping at epoch {epoch}") break # Load best checkpoint checkpoint = torch.load('best_checkpoint.pt') model.load_state_dict(checkpoint['model_state_dict'])
06 — Hardware Utilization

GPU Acceleration: Moving to CUDA

PyTorch runs on CPU by default. GPU (CUDA) training is 10–100× faster. The pattern is simple: move tensors and models to device, compute, move results back if needed. Device-agnostic code makes switching easy.

Device Management

Always write device-agnostic code using a device variable. Call .to(device) on models and tensors. Check CUDA availability and manage memory with torch.cuda.empty_cache().

import torch # Check CUDA print(torch.cuda.is_available()) print(torch.cuda.get_device_name(0)) print(torch.cuda.get_device_properties(0)) # Device-agnostic code device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') # Move model and data model = SimpleMLP(128, 256, 10).to(device) x = torch.randn(32, 128).to(device) output = model(x) # Computes on GPU # Move tensors between devices y = output.cpu() # back to CPU z = y.to('cuda:1') # to GPU 1 # Free GPU memory torch.cuda.empty_cache() # Get current memory usage print(torch.cuda.memory_allocated(device)) print(torch.cuda.memory_reserved(device))

Distributed Training: DataParallel and DistributedDataParallel

For multi-GPU training, wrap models with DistributedDataParallel (recommended) or DataParallel. DDP synchronizes gradients across GPUs; DataParallel is simpler but slower.

# DataParallel: simple but slower synchronization model = SimpleMLP(128, 256, 10) if torch.cuda.device_count() > 1: model = nn.DataParallel(model) model.to(device) # DistributedDataParallel: proper multi-GPU (recommended for production) # Run with: python -m torch.distributed.launch --nproc_per_node=4 train.py import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel dist.init_process_group(backend='nccl') model = SimpleMLP(128, 256, 10).to(device) model = DistributedDataParallel(model, device_ids=[dist.get_rank()]) # Use DistributedSampler to split data across processes sampler = torch.utils.data.distributed.DistributedSampler( dataset, num_replicas=dist.get_world_size(), rank=dist.get_rank(), shuffle=True ) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

Memory Management

GPU memory is limited. Monitor usage with nvidia-smi or PyTorch APIs. Use gradient checkpointing to trade computation for memory (recompute activations instead of storing them).

# Gradient checkpointing: trades memory for compute from torch.utils.checkpoint import checkpoint class CheckpointedBlock(nn.Module): def __init__(self): super().__init__() self.layer1 = nn.Linear(256, 256) self.layer2 = nn.Linear(256, 256) self.relu = nn.ReLU() def forward(self, x): # Recompute layer1 output during backward instead of storing it return checkpoint(self._inner_forward, x, use_reentrant=False) def _inner_forward(self, x): x = self.layer1(x) x = self.relu(x) x = self.layer2(x) return x # Memory usage drops ~30% but compute increases ~20%
07 — Saving and Loading

Model Persistence: Checkpoints and Formats

Save models as .pt files containing state_dict (recommended for research) or full models (less portable). For production, use standardized formats like ONNX or TorchScript for deployment.

state_dict vs Full Model

Always save state_dict, not the full model object. state_dict is portable, versionable, and composable. Full model pickling binds you to your exact code structure.

# GOOD: Save only state_dict torch.save(model.state_dict(), 'model.pt') # Load (requires model class definition) model = SimpleMLP(128, 256, 10) model.load_state_dict(torch.load('model.pt')) # BAD: Save full model (not recommended) torch.save(model, 'model_full.pt') # Pickles entire model object model = torch.load('model_full.pt') # Breaks with code changes # Checkpoint: save state + optimizer + epoch for resuming checkpoint = { 'epoch': 42, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'scheduler_state_dict': scheduler.state_dict(), 'loss': loss_value } torch.save(checkpoint, 'checkpoint_epoch42.pt') # Resume from checkpoint checkpoint = torch.load('checkpoint_epoch42.pt') model.load_state_dict(checkpoint['model_state_dict']) optimizer.load_state_dict(checkpoint['optimizer_state_dict']) start_epoch = checkpoint['epoch']

ONNX Export for Inference

Convert PyTorch models to ONNX (Open Neural Network Exchange) for deployment to production systems, mobile, or other frameworks. ONNX decouples the model from PyTorch.

import torch.onnx # Export to ONNX model.eval() dummy_input = torch.randn(1, 128) torch.onnx.export( model, dummy_input, 'model.onnx', input_names=['input'], output_names=['output'], opset_version=14, dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}} ) # Verify with onnxruntime import onnxruntime as ort ort_session = ort.InferenceSession('model.onnx') ort_output = ort_session.run(None, {'input': dummy_input.numpy()}) print(ort_output[0].shape) # [1, 10]

TorchScript for Production

TorchScript compiles PyTorch models to a serialized format that runs without Python. Use torch.jit.script or torch.jit.trace to compile models for deployment.

import torch # JIT tracing: captures execution path, works with most code model = SimpleMLP(128, 256, 10) dummy_input = torch.randn(1, 128) scripted_model = torch.jit.trace(model, dummy_input) scripted_model.save('model_traced.pt') # Load and run loaded = torch.jit.load('model_traced.pt') output = loaded(dummy_input) # JIT scripting: compiles Python code to TorchScript (more flexible) class ScriptedModel(torch.nn.Module): def forward(self, x: torch.Tensor) -> torch.Tensor: return x + 1.0 scripted = torch.jit.script(ScriptedModel()) scripted.save('model_scripted.pt')

HuggingFace Integration

For transformer models, HuggingFace provides high-level APIs for saving/loading. Models include config, weights, and tokenizer. Integrates seamlessly with PyTorch ecosystem.

from transformers import AutoModel, AutoTokenizer # Load pretrained model model = AutoModel.from_pretrained('bert-base-uncased') tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Fine-tune (your training code) # ... # Save model.save_pretrained('./my_model') tokenizer.save_pretrained('./my_model') # Later: load from local model = AutoModel.from_pretrained('./my_model') tokenizer = AutoTokenizer.from_pretrained('./my_model')
08 — Ecosystem

Tools & Ecosystem

PyTorch is surrounded by a rich ecosystem of tools. These handle training workflows, distributed compute, monitoring, and deployment.

Framework
Lightning
High-level training framework that abstracts training loops, distributed training, and checkpointing. Reduces boilerplate significantly.
Monitoring
Weights & Biases
Experiment tracking: log metrics, visualize training, compare runs, and share results. Industry standard for ML research.
Visualization
TensorBoard
Real-time visualization of training metrics, histograms, embeddings, and computational graphs. Built-in PyTorch integration.
Computer Vision
torchvision
Pretrained models (ResNet, EfficientNet, ViT), datasets (CIFAR-10, ImageNet), and image transforms for vision tasks.
Data Pipelines
torchdata
High-performance data loading with functional composition, supports distributed dataloading and async I/O.
Deployment
ONNX
Cross-platform model format for deployment to inference servers, mobile, edge, and other frameworks.
Deployment
TorchServe
Production model serving with batching, versioning, and A/B testing. Official PyTorch deployment tool.
Compilation
TorchScript
Compile PyTorch models to an intermediate representation that runs without Python. For production deployment.
ComponentWhat It DoesKey MethodCommon Mistake
torch.TensorN-dimensional array on CPU or GPU.cuda(), .float()Forgetting .detach() when converting to NumPy
autogradAuto-computes gradients via chain rule.backward()Calling backward() without zeroing grads first
nn.ModuleBase class for all layers and modelsforward(), .parameters()Forgetting to call model.train() / model.eval()
DataLoaderBatches and shuffles datasets__iter__Not setting num_workers — bottlenecks GPU
OptimizerUpdates weights using gradients.step(), .zero_grad()Forgetting .zero_grad() causes gradient accumulation
GradScalerMixed precision training stability.scale(), .update()Not calling scaler.update() after each step
Further Reading

References

Official Docs
Research Papers
Practitioner Guides
Community Resources