PyTorch Basics

PyTorch Tensors

Core tensor operations — the lingua franca of deep learning. Everything in PyTorch builds on multi-dimensional arrays with GPU acceleration.

N-D
Tensors
GPU
Accelerated
float32
Default dtype

Table of Contents

SECTION 01

Tensors vs NumPy

PyTorch tensors are like NumPy arrays but with two superpowers: they run on GPU and they track gradients for automatic differentiation. Switching from NumPy to PyTorch is mostly just changing API names.

OperationNumPyPyTorch
Create arraynp.array([1,2,3])torch.tensor([1,2,3])
Random normalnp.random.randn(3,4)torch.randn(3,4)
Matrix multiplyA @ BA @ B (same!)
Reshape.reshape().reshape() or .view()
Max value.max().max() or .amax()
Move to GPUN/A.to("cuda") or .cuda()
SECTION 02

Creation & Device Management

import torch # Creation x = torch.tensor([1.0, 2.0, 3.0]) # From Python list x = torch.randn(32, 768) # Standard normal, shape (32, 768) x = torch.zeros(3, 4) # All zeros x = torch.ones(3, 4) # All ones x = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8] x = torch.linspace(0, 1, 100) # 100 evenly spaced points x = torch.eye(4) # 4x4 identity matrix # Device management device = "cuda" if torch.cuda.is_available() else "cpu" x = torch.randn(32, 768).to(device) # Move to GPU if available x = x.cuda() # Shorthand for .to("cuda") x = x.cpu() # Move back to CPU # Create directly on GPU (avoids CPU→GPU copy) x = torch.randn(32, 768, device=device) # Check device print(x.device) # "cpu" or "cuda:0" # Memory info print(torch.cuda.memory_allocated() / 1e6, "MB allocated") torch.cuda.empty_cache() # Free unused GPU memory
SECTION 03

Indexing & Reshaping

import torch x = torch.arange(24).reshape(2, 3, 4) # Shape: (2, 3, 4) # Basic indexing (same as NumPy) x[0] # Shape: (3, 4) — first batch item x[0, 1] # Shape: (4,) — first batch, second row x[0, 1, 2] # Scalar: 6 # Slicing x[:, :2, :] # Shape: (2, 2, 4) — first 2 rows of each batch item x[..., -1] # Shape: (2, 3) — last element in last dim (... = all preceding dims) # Reshaping — data is the same, just different view x = torch.randn(32, 512, 768) x_flat = x.view(32, -1) # (32, 262144) — -1 infers size x_flat = x.reshape(32, -1) # Same, but handles non-contiguous memory # Permute dims (like numpy.transpose for N dims) x = torch.randn(32, 768, 512) x_t = x.permute(0, 2, 1) # (32, 512, 768) — swap last two dims # Squeeze/unsqueeze — add or remove size-1 dims x = torch.randn(32, 768) x = x.unsqueeze(1) # (32, 1, 768) — add dim at position 1 x = x.squeeze(1) # (32, 768) — remove size-1 dim
SECTION 04

Common Operations

import torch import torch.nn.functional as F # Elementwise operations a = torch.randn(4, 4) b = torch.randn(4, 4) c = a + b; c = a * b; c = a / b # Standard arithmetic c = torch.relu(a) c = torch.sigmoid(a) c = torch.tanh(a) c = F.gelu(a) # Activation functions # Reduction operations a.sum() # Scalar sum of all elements a.sum(dim=0) # Sum along dim 0 → shape (4,) a.mean(dim=-1, keepdim=True) # Mean along last dim, keep dim → (4,1) a.max() # Max value (scalar) a.max(dim=1) # Returns (values, indices) along dim 1 # Matrix operations a @ b # Matrix multiply (2D) torch.bmm(A, B) # Batched matrix multiply: (batch, n, k) @ (batch, k, m) torch.einsum("bij,bjk->bik", A, B) # Einstein summation # Concatenation and stacking c = torch.cat([a, b], dim=0) # Concatenate along existing dim c = torch.stack([a, b], dim=0) # Stack — creates new dim # Softmax probs = F.softmax(logits, dim=-1) # Always specify dim!
SECTION 05

Memory Efficiency

import torch # in-place operations (suffix _) — modify tensor without allocating new memory x = torch.randn(1000, 1000) x.relu_() # In-place relu x.add_(1.0) # In-place add x.mul_(0.5) # In-place multiply # Warning: in-place ops break autograd — only use outside training loops # Contiguous memory — view() requires contiguous layout x = torch.randn(32, 512, 768) x_t = x.permute(0, 2, 1) # Non-contiguous after permute x_t.contiguous() # Copy to contiguous — allows .view() # Or just use .reshape() which handles both cases # No gradient computation (saves memory during inference) with torch.no_grad(): output = model(input) # No gradient tape — ~2x less memory # Half precision — 2x memory savings x_half = x.half() # float16 x_bf16 = x.bfloat16() # bfloat16 (better for training) print(x.element_size()) # 4 bytes (float32) vs 2 bytes (float16)
SECTION 06

dtype & Precision

import torch # Common dtypes torch.float32 # Default for weights, 4 bytes torch.float16 # Half precision, 2 bytes — good for inference torch.bfloat16 # Brain float, 2 bytes — better range than fp16, for training torch.int64 # Token IDs, labels torch.bool # Attention masks # Explicit dtype at creation x = torch.randn(100, 100, dtype=torch.float16) labels = torch.zeros(32, dtype=torch.long) # long = int64 # Conversion x_fp32 = x.float() # → float32 x_fp16 = x.half() # → float16 x_bf16 = x.bfloat16() # → bfloat16 # Mixed precision training (PyTorch AMP) from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(): # Ops run in fp16/bf16 automatically output = model(input) loss = criterion(output, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
bfloat16 vs float16: bfloat16 has the same exponent range as float32 (less overflow risk) but less mantissa precision. Prefer bfloat16 for training (A100/H100 support it natively). Use float16 for inference on older GPUs.

Autograd and gradient computation

PyTorch's autograd engine builds a dynamic computational graph as operations are performed on tensors with requires_grad=True. When loss.backward() is called, the engine traverses this graph in reverse order (backpropagation) to compute gradients for all leaf tensors. The .grad attribute stores the accumulated gradient after backward(). Multiple backward() calls accumulate gradients — calling optimizer.zero_grad() before each training step is necessary to clear accumulated gradients from the previous step and prevent incorrect parameter updates.

GPU memory management patterns

Managing GPU memory correctly is one of the most practically important PyTorch skills. Tensors created with .cuda() or moved with .to("cuda") consume GPU memory until they are deleted or go out of scope. The garbage collector does not immediately free GPU memory when Python objects are deleted — torch.cuda.empty_cache() releases cached but unused GPU memory back to the OS. For large models, using torch.no_grad() context managers in inference code prevents autograd from storing intermediate activations needed for gradients, halving peak memory usage.

OperationMemory impactWhen to use
torch.no_grad()No activation storageInference, validation loops
model.half() / bfloat162x reductionTraining on A100/H100
gradient checkpointingSqrt reductionVery deep models, limited VRAM
torch.cuda.empty_cache()Frees cached blocksAfter large allocations

Memory-efficient tensor operations

PyTorch tensors maintain a view-based memory model where many operations don't allocate new memory but create alternative interpretations of the same underlying data. Operations like transpose, reshape, and narrowing can return views without copying when the data is contiguous in memory. This property enables writing efficient code: a reshape(B*T, D) on a (B, T, D) tensor for feeding into a linear layer costs nearly zero memory if the tensor is C-contiguous. Understanding when operations return views versus copies is crucial for managing CUDA memory in large-scale training. The `.is_contiguous()` method reveals whether reshaping will be a copy operation, and `.contiguous()` forces a copy when necessary. In production pipelines, this distinction becomes the difference between fitting 8 GPUs' worth of data through a single GPU or running out-of-memory errors on the same hardware.

Automatic differentiation and tensor graphs

Every PyTorch tensor carries metadata: a `.grad_fn` attribute that stores the operation that created it and its backward function. When a tensor is the product of operations (matmul, addition, activation), its grad_fn points to the computation that produced it. During backpropagation, PyTorch traverses this graph in reverse, calling each function's backward method to compute gradients. This dynamic graph construction—built fresh on every forward pass—is PyTorch's defining feature versus static graph frameworks like TensorFlow 1.x. The cost is flexibility and debuggability; the benefit is that control flow (if-statements, loops) within the forward pass is fully supported. For practitioners, understanding grad_fn reveals whether a tensor is differentiable and helps debug why `.backward()` fails. Detaching tensors with `.detach()` breaks the grad_fn chain, enabling scenarios like sampling from a distribution during forward pass while computing gradients through the log-probability, not the sample itself.

Broadcasting and numerical stability

PyTorch's broadcasting rules extend NumPy's conventions to higher-dimensional tensors: dimensions of size 1 expand to match other operands, and missing dimensions are treated as size 1. This enables powerful element-wise operations without explicit expansion, but misunderstanding broadcasting is a common source of shape errors and numerical issues. For instance, subtracting a per-token bias (shape [T, 1]) from logits (shape [B, T, V]) silently broadcasts the bias across batch and vocabulary dimensions—correct behavior, but easy to accidentally reverse. Beyond correctness, broadcasting interacts with numerical stability: some operations like log-softmax have numerically unstable implementations without proper broadcasting (log-sum-exp requires subtracting the maximum per example to prevent overflow). PyTorch provides numerically-stable variants like `F.log_softmax()` and `F.cross_entropy()` that internally manage broadcasting and numerical precision, making them safer defaults than manual implementations.

Memory-efficient tensor operations

PyTorch tensors maintain a view-based memory model where many operations don't allocate new memory but create alternative interpretations of the same underlying data. Operations like transpose, reshape, and narrowing can return views without copying when the data is contiguous in memory. This property enables writing efficient code: a reshape(B*T, D) on a (B, T, D) tensor for feeding into a linear layer costs nearly zero memory if the tensor is C-contiguous. Understanding when operations return views versus copies is crucial for managing CUDA memory in large-scale training. The `.is_contiguous()` method reveals whether reshaping will be a copy operation, and `.contiguous()` forces a copy when necessary. In production pipelines, this distinction becomes the difference between fitting 8 GPUs' worth of data through a single GPU or running out-of-memory errors on the same hardware.