01 — Foundation
Tensors: The Core Data Structure
Tensors are multidimensional arrays — the atomic unit of PyTorch computation. Everything flows through tensors: inputs, weights, gradients, outputs. Understanding tensor operations is the foundation of PyTorch proficiency.
Creation and Basic Properties
Tensors can be created from Python lists, NumPy arrays, or generated directly. The three properties you must understand are shape (dimensions), dtype (data type), and device (CPU or GPU).
import torch
# Create from list
x = torch.tensor([1.0, 2.0, 3.0])
print(x.shape) # torch.Size([3])
# Create 2D tensor
y = torch.randn(batch_size=32, features=128)
print(y.shape) # torch.Size([32, 128])
# Check properties
print(x.dtype) # torch.float32
print(y.device) # cpu
# Convert dtype
z = x.float() # ensure float32
w = y.long() # convert to int64
# Move to GPU
if torch.cuda.is_available():
x_gpu = x.cuda()
# or: x_gpu = x.to('cuda:0')
Broadcasting and Reshaping
PyTorch follows NumPy broadcasting semantics. Shapes are automatically aligned from the right. Views and reshapes allow you to change tensor shape without copying data (views) or with copying (.reshape()).
# Broadcasting
a = torch.ones(3, 1, 4) # shape [3, 1, 4]
b = torch.ones(1, 5, 4) # shape [1, 5, 4]
c = a + b # broadcasts to [3, 5, 4]
# View vs reshape
x = torch.arange(6) # [0,1,2,3,4,5]
y = x.view(2, 3) # reshape without copy (faster)
z = x.reshape(3, 2) # reshape with possible copy
# Contiguous memory matters
z_t = z.t() # transposed view
z_c = z_t.contiguous() # convert to C-contiguous for some ops
Indexing and Selection
Tensors support advanced indexing just like NumPy. Understand the difference between fancy indexing (which returns a copy) and views (which are references).
# Basic indexing
x = torch.randn(10, 5)
first_row = x[0] # shape [5]
first_col = x[:, 0] # shape [10]
subset = x[2:5, 1:4] # shape [3, 3]
# Masking
mask = x > 0
positive = x[mask] # fancy indexing
# Gather and scatter
indices = torch.tensor([0, 2, 4])
gathered = x[indices] # select rows 0, 2, 4
02 — Automatic Differentiation
Autograd: Computing Gradients
Autograd is PyTorch's automatic differentiation system. It tracks operations on tensors and computes gradients via backpropagation. Every tensor has a requires_grad flag that tells PyTorch to record operations on it for the backward pass.
Gradient Tracking and Backward
When you set requires_grad=True, PyTorch builds a computation graph. Calling .backward() triggers backpropagation, computing gradients for all tensors in the graph. Gradients accumulate in the .grad attribute.
import torch
# Simple scalar backward
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()
print(x.grad) # dy/dx = 2x + 3 = 7.0
# Vector backward (implicit reduction)
x = torch.randn(5, requires_grad=True)
y = (x ** 2).sum()
y.backward()
print(x.grad) # grad[i] = 2 * x[i]
# Manual gradient specification
x = torch.randn(3, requires_grad=True)
y = x * 2
y.backward(torch.ones_like(y)) # specify output gradient
print(x.grad) # [2, 2, 2]
Gradient Accumulation and Zeroing
Gradients accumulate by default when you call .backward() multiple times on the same tensor. In training loops, you must zero gradients before each backward pass to prevent accumulation.
# Gradient accumulation
x = torch.tensor(3.0, requires_grad=True)
for i in range(3):
y = x ** 2
y.backward()
print(x.grad) # 6.0 + 6.0 + 6.0 = 18.0
# Correct pattern: zero_grad before backward
x = torch.tensor(3.0, requires_grad=True)
for i in range(3):
x.grad = None # or: x.grad.zero_()
y = x ** 2
y.backward()
print(x.grad) # 6.0 (correct)
no_grad Context and Inference
Use torch.no_grad() context to disable gradient tracking during inference or when you need to modify tensors without affecting gradients. This saves memory and computation.
# Inference: disable gradient tracking
model.eval()
with torch.no_grad():
logits = model(x_test)
predictions = logits.argmax(dim=-1)
# Manual gradient updates (weight decay example)
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
# detach() breaks the computation graph
y = x ** 2
z = (y.detach()) ** 2 # z won't flow gradients back to x
03 — Model Architecture
nn.Module: Building Model Layers
nn.Module is the base class for all PyTorch models. It organizes parameters, buffers, and submodules. When you define a custom model, inherit from nn.Module, initialize layers in __init__, and implement forward().
Basic Module Structure
Every module tracks its parameters and submodules automatically. When you call model(x), it invokes the forward() method. Parameters become attributes and are automatically registered for optimization.
import torch
import torch.nn as nn
class SimpleMLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
model = SimpleMLP(128, 256, 10)
print(model) # shows layer structure
# Access parameters
for name, param in model.named_parameters():
print(name, param.shape)
# Forward pass
x = torch.randn(32, 128) # batch_size=32, features=128
output = model(x) # calls forward()
print(output.shape) # [32, 10]
Parameter Management and state_dict
Models store learned parameters as nn.Parameter objects. Use state_dict() to get all parameters and buffers as a dictionary — essential for checkpointing and distributed training.
# Save and load model state
torch.save(model.state_dict(), 'checkpoint.pt')
# Later: load state into a fresh model
model = SimpleMLP(128, 256, 10)
model.load_state_dict(torch.load('checkpoint.pt'))
# Freeze parameters
for param in model.fc1.parameters():
param.requires_grad = False
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total_params}, Trainable: {trainable_params}")
Custom Layers and Modules
Extend nn.Module to create custom layers. Register parameters with nn.Parameter() and submodules with direct assignment in __init__. PyTorch automatically tracks them.
class GateLayer(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.gate_weight = nn.Parameter(torch.randn(out_features, in_features))
self.bias = nn.Parameter(torch.zeros(out_features))
def forward(self, x):
gate = torch.sigmoid(x @ self.gate_weight.t())
output = (x @ self.weight.t() + self.bias) * gate
return output
04 — Data Pipeline
DataLoader: Batching and Sampling
DataLoader handles the data pipeline: sampling batches, shuffling, collating samples, and prefetching. It wraps a Dataset (which provides individual samples) and handles multiprocessing for efficient I/O.
Dataset and DataLoader Basics
Implement a custom Dataset by inheriting from torch.utils.data.Dataset. Implement __len__() to return dataset size and __getitem__() to return individual samples. Pass to DataLoader for batching.
from torch.utils.data import Dataset, DataLoader
import torch
class CustomDataset(Dataset):
def __init__(self, inputs, labels):
self.inputs = inputs
self.labels = labels
def __len__(self):
return len(self.inputs)
def __getitem__(self, idx):
return self.inputs[idx], self.labels[idx]
# Create dataset
dataset = CustomDataset(
inputs=torch.randn(1000, 128),
labels=torch.randint(0, 10, (1000,))
)
# Create DataLoader
dataloader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4,
pin_memory=True
)
# Iterate
for batch_inputs, batch_labels in dataloader:
print(batch_inputs.shape) # [32, 128]
print(batch_labels.shape) # [32]
break
Collate Functions and Custom Batching
The collate_fn merges samples into a batch. Default stacking works for fixed-size tensors. For variable-length data (e.g., sequences), implement custom collate functions to pad or truncate.
def pad_collate_fn(batch):
"""Pad sequences to the same length in a batch."""
inputs, labels = zip(*batch)
# Pad sequences to max length in batch
max_len = max(len(x) for x in inputs)
padded = torch.zeros(len(inputs), max_len)
for i, x in enumerate(inputs):
padded[i, :len(x)] = x
labels = torch.stack(labels)
return padded, labels
dataloader = DataLoader(
dataset,
batch_size=32,
collate_fn=pad_collate_fn,
num_workers=0 # Custom collate often requires num_workers=0
)
Optimization: num_workers and pin_memory
num_workers > 0 spawns background processes to load data in parallel while the GPU trains. pin_memory=True pre-allocates pinned CPU memory for faster GPU transfers. Use both for maximum throughput.
# Optimal DataLoader for GPU training
train_loader = DataLoader(
train_dataset,
batch_size=128,
shuffle=True,
num_workers=4, # 4 background workers loading data
pin_memory=True, # pinned CPU memory
persistent_workers=True, # keep workers alive between epochs
prefetch_factor=2 # prefetch 2 batches per worker
)
# For small datasets or debugging, use num_workers=0
debug_loader = DataLoader(
small_dataset,
batch_size=8,
num_workers=0
)
05 — Core Training Pattern
The Training Loop: Putting It Together
The training loop is the core pattern: forward pass, compute loss, backward pass, optimizer step. Repeat over batches. Validation happens at epoch boundaries to detect overfitting.
Basic Training Loop
import torch
import torch.nn as nn
import torch.optim as optim
model = SimpleMLP(128, 256, 10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
num_epochs = 10
for epoch in range(num_epochs):
model.train() # enable training mode (dropout, batchnorm)
epoch_loss = 0.0
for batch_inputs, batch_labels in train_loader:
# Move to device
batch_inputs = batch_inputs.to(device)
batch_labels = batch_labels.to(device)
# Forward pass
outputs = model(batch_inputs)
loss = criterion(outputs, batch_labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
# Optimizer step
optimizer.step()
epoch_loss += loss.item()
# Validation at epoch boundary
model.eval()
val_loss = 0.0
with torch.no_grad():
for batch_inputs, batch_labels in val_loader:
batch_inputs = batch_inputs.to(device)
batch_labels = batch_labels.to(device)
outputs = model(batch_inputs)
val_loss += criterion(outputs, batch_labels).item()
print(f"Epoch {epoch+1}: train_loss={epoch_loss/len(train_loader):.4f}, val_loss={val_loss/len(val_loader):.4f}")
Advanced: Gradient Accumulation and Mixed Precision
For large models that don't fit with large batches, accumulate gradients over multiple forward passes before stepping the optimizer. Mixed precision training uses float16 for speed and float32 for stability.
# Gradient accumulation for larger effective batch size
accumulation_steps = 4
optimizer.zero_grad()
for batch_idx, (inputs, labels) in enumerate(train_loader):
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels) / accumulation_steps
loss.backward()
if (batch_idx + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# Mixed precision with autocast
from torch.amp import autocast, GradScaler
scaler = GradScaler()
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
with autocast(device_type='cuda'):
outputs = model(inputs)
loss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Checkpointing and Early Stopping
Save model checkpoints during training to recover from crashes or resume training. Implement early stopping to stop training when validation performance plateaus, preventing overfitting.
best_val_loss = float('inf')
patience = 3
patience_counter = 0
for epoch in range(num_epochs):
# ... training code ...
# Validation
val_loss = evaluate(model, val_loader, device)
# Checkpointing
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'val_loss': val_loss
}, 'best_checkpoint.pt')
else:
patience_counter += 1
# Early stopping
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch}")
break
# Load best checkpoint
checkpoint = torch.load('best_checkpoint.pt')
model.load_state_dict(checkpoint['model_state_dict'])
06 — Hardware Utilization
GPU Acceleration: Moving to CUDA
PyTorch runs on CPU by default. GPU (CUDA) training is 10–100× faster. The pattern is simple: move tensors and models to device, compute, move results back if needed. Device-agnostic code makes switching easy.
Device Management
Always write device-agnostic code using a device variable. Call .to(device) on models and tensors. Check CUDA availability and manage memory with torch.cuda.empty_cache().
import torch
# Check CUDA
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
print(torch.cuda.get_device_properties(0))
# Device-agnostic code
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# Move model and data
model = SimpleMLP(128, 256, 10).to(device)
x = torch.randn(32, 128).to(device)
output = model(x) # Computes on GPU
# Move tensors between devices
y = output.cpu() # back to CPU
z = y.to('cuda:1') # to GPU 1
# Free GPU memory
torch.cuda.empty_cache()
# Get current memory usage
print(torch.cuda.memory_allocated(device))
print(torch.cuda.memory_reserved(device))
Distributed Training: DataParallel and DistributedDataParallel
For multi-GPU training, wrap models with DistributedDataParallel (recommended) or DataParallel. DDP synchronizes gradients across GPUs; DataParallel is simpler but slower.
# DataParallel: simple but slower synchronization
model = SimpleMLP(128, 256, 10)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
model.to(device)
# DistributedDataParallel: proper multi-GPU (recommended for production)
# Run with: python -m torch.distributed.launch --nproc_per_node=4 train.py
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
dist.init_process_group(backend='nccl')
model = SimpleMLP(128, 256, 10).to(device)
model = DistributedDataParallel(model, device_ids=[dist.get_rank()])
# Use DistributedSampler to split data across processes
sampler = torch.utils.data.distributed.DistributedSampler(
dataset,
num_replicas=dist.get_world_size(),
rank=dist.get_rank(),
shuffle=True
)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
Memory Management
GPU memory is limited. Monitor usage with nvidia-smi or PyTorch APIs. Use gradient checkpointing to trade computation for memory (recompute activations instead of storing them).
# Gradient checkpointing: trades memory for compute
from torch.utils.checkpoint import checkpoint
class CheckpointedBlock(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(256, 256)
self.layer2 = nn.Linear(256, 256)
self.relu = nn.ReLU()
def forward(self, x):
# Recompute layer1 output during backward instead of storing it
return checkpoint(self._inner_forward, x, use_reentrant=False)
def _inner_forward(self, x):
x = self.layer1(x)
x = self.relu(x)
x = self.layer2(x)
return x
# Memory usage drops ~30% but compute increases ~20%
07 — Saving and Loading
Model Persistence: Checkpoints and Formats
Save models as .pt files containing state_dict (recommended for research) or full models (less portable). For production, use standardized formats like ONNX or TorchScript for deployment.
state_dict vs Full Model
Always save state_dict, not the full model object. state_dict is portable, versionable, and composable. Full model pickling binds you to your exact code structure.
# GOOD: Save only state_dict
torch.save(model.state_dict(), 'model.pt')
# Load (requires model class definition)
model = SimpleMLP(128, 256, 10)
model.load_state_dict(torch.load('model.pt'))
# BAD: Save full model (not recommended)
torch.save(model, 'model_full.pt') # Pickles entire model object
model = torch.load('model_full.pt') # Breaks with code changes
# Checkpoint: save state + optimizer + epoch for resuming
checkpoint = {
'epoch': 42,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'loss': loss_value
}
torch.save(checkpoint, 'checkpoint_epoch42.pt')
# Resume from checkpoint
checkpoint = torch.load('checkpoint_epoch42.pt')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
ONNX Export for Inference
Convert PyTorch models to ONNX (Open Neural Network Exchange) for deployment to production systems, mobile, or other frameworks. ONNX decouples the model from PyTorch.
import torch.onnx
# Export to ONNX
model.eval()
dummy_input = torch.randn(1, 128)
torch.onnx.export(
model,
dummy_input,
'model.onnx',
input_names=['input'],
output_names=['output'],
opset_version=14,
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)
# Verify with onnxruntime
import onnxruntime as ort
ort_session = ort.InferenceSession('model.onnx')
ort_output = ort_session.run(None, {'input': dummy_input.numpy()})
print(ort_output[0].shape) # [1, 10]
TorchScript for Production
TorchScript compiles PyTorch models to a serialized format that runs without Python. Use torch.jit.script or torch.jit.trace to compile models for deployment.
import torch
# JIT tracing: captures execution path, works with most code
model = SimpleMLP(128, 256, 10)
dummy_input = torch.randn(1, 128)
scripted_model = torch.jit.trace(model, dummy_input)
scripted_model.save('model_traced.pt')
# Load and run
loaded = torch.jit.load('model_traced.pt')
output = loaded(dummy_input)
# JIT scripting: compiles Python code to TorchScript (more flexible)
class ScriptedModel(torch.nn.Module):
def forward(self, x: torch.Tensor) -> torch.Tensor:
return x + 1.0
scripted = torch.jit.script(ScriptedModel())
scripted.save('model_scripted.pt')
HuggingFace Integration
For transformer models, HuggingFace provides high-level APIs for saving/loading. Models include config, weights, and tokenizer. Integrates seamlessly with PyTorch ecosystem.
from transformers import AutoModel, AutoTokenizer
# Load pretrained model
model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Fine-tune (your training code)
# ...
# Save
model.save_pretrained('./my_model')
tokenizer.save_pretrained('./my_model')
# Later: load from local
model = AutoModel.from_pretrained('./my_model')
tokenizer = AutoTokenizer.from_pretrained('./my_model')
Further Reading
References
Official Docs
Research Papers
Practitioner Guides
Community Resources