NumPy

Why NumPy?
Array Creation & Indexing
Broadcasting
Linear Algebra Ops
Common ML Preprocessing
NumPy → PyTorch

SECTION 01

Why NumPy?

Python lists are slow for numerical computation — iterating element-by-element in Python is 100-1000× slower than vectorized C operations. NumPy wraps fast C/Fortran routines with a clean Python API.

Core idea: Avoid Python loops. Express operations as array operations. NumPy dispatches them to optimized C code. This is called vectorization.

Speed: Array operations run in compiled C/Fortran — orders of magnitude faster than Python loops
Memory: Contiguous memory layout, no Python object overhead
Ecosystem: SciPy, pandas, matplotlib, scikit-learn all use NumPy arrays as their data format
PyTorch interop: .numpy() and torch.from_numpy() convert between tensors and arrays

SECTION 02

Array Creation & Indexing

import numpy as np # Creation a = np.array([1.0, 2.0, 3.0]) # from list b = np.zeros((3, 4)) # all zeros c = np.ones((2, 3)) * 5 # all 5s d = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] e = np.linspace(0, 1, 100) # 100 points from 0 to 1 f = np.random.randn(32, 768) # standard normal, shape (32, 768) # Indexing — NumPy uses 0-based indexing arr = np.arange(12).reshape(3, 4) # [[0, 1, 2, 3], # [4, 5, 6, 7], # [8, 9,10,11]] arr[1, 2] # 6 — row 1, col 2 arr[:, 2] # [2, 6, 10] — all rows, col 2 arr[1:, :2] # [[4,5],[8,9]] — slicing arr[[0, 2], :] # rows 0 and 2 — fancy indexing arr[arr > 5] # [6,7,8,9,10,11] — boolean indexing

SECTION 03

Broadcasting

Broadcasting automatically expands arrays with size-1 dimensions to match. This eliminates explicit reshaping in most operations.

import numpy as np # Rule: align shapes from right; size-1 dims expand automatically # (32, 768) op (768,) → OK: (768,) broadcasts to (32, 768) # (32, 1) op (1, 64) → OK: both broadcast to (32, 64) # Layer normalization (simplified) activations = np.random.randn(32, 768) # (batch, dim) mean = activations.mean(axis=-1, keepdims=True) # (32, 1) std = activations.std(axis=-1, keepdims=True) # (32, 1) normalized = (activations - mean) / std # (32, 768) — broadcasting! # Pairwise distances A = np.random.randn(100, 128) # 100 vectors B = np.random.randn(50, 128) # 50 vectors # ||a - b||² = ||a||² + ||b||² - 2a·b dists = ( np.sum(A**2, axis=1, keepdims=True) # (100, 1) + np.sum(B**2, axis=1) # (50,) → broadcasts to (100, 50) - 2 * A @ B.T # (100, 50) ) # Shape: (100, 50) — no loops!

SECTION 04

Linear Algebra Ops

import numpy as np # Matrix operations A = np.random.randn(4, 4) B = np.random.randn(4, 3) C = A @ B # Matrix multiply: (4, 3) D = np.linalg.inv(A) # Inverse (square only) vals, vecs = np.linalg.eig(A) # Eigendecomposition # SVD — used in PCA, LoRA, compression U, S, Vt = np.linalg.svd(A) # A ≈ U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :] for rank-r approx # Solving systems of linear equations: Ax = b b = np.random.randn(4) x = np.linalg.solve(A, b) # Faster than inv(A) @ b # Norms np.linalg.norm(A) # Frobenius norm (default) np.linalg.norm(A, ord=2) # Spectral norm (largest singular value) np.linalg.norm(b) # L2 norm of vector

SECTION 05

Common ML Preprocessing

import numpy as np # Standard scaling (zero mean, unit variance) def standard_scale(X): mean = X.mean(axis=0) std = X.std(axis=0) + 1e-8 # avoid /0 return (X - mean) / std # One-hot encoding def one_hot(labels, num_classes): n = len(labels) out = np.zeros((n, num_classes)) out[np.arange(n), labels] = 1 return out # Cosine similarity matrix (all-pairs) def cosine_matrix(embeddings): norms = np.linalg.norm(embeddings, axis=1, keepdims=True) normed = embeddings / (norms + 1e-8) return normed @ normed.T # (n, n) — each entry is cosine sim # Softmax (stable implementation) def softmax(x): x = x - x.max(axis=-1, keepdims=True) # numerical stability e = np.exp(x) return e / e.sum(axis=-1, keepdims=True)

SECTION 06

NumPy → PyTorch

NumPy and PyTorch share memory layouts — conversion is zero-copy for CPU tensors.

import numpy as np import torch # NumPy → PyTorch (shares memory — no copy!) arr = np.random.randn(32, 768).astype(np.float32) tensor = torch.from_numpy(arr) # Modifying arr will change tensor and vice versa # PyTorch → NumPy (must be on CPU, no grad) tensor_cpu = torch.randn(32, 768) arr2 = tensor_cpu.numpy() # zero-copy arr3 = tensor_cpu.detach().cpu().numpy() # safe version (any device) # dtype gotcha: NumPy defaults to float64, PyTorch expects float32 arr64 = np.random.randn(32, 768) # float64 tensor64 = torch.from_numpy(arr64.astype(np.float32)) # explicit cast # DataLoader with NumPy arrays from torch.utils.data import TensorDataset, DataLoader X = torch.from_numpy(X_np.astype(np.float32)) y = torch.from_numpy(y_np.astype(np.long)) loader = DataLoader(TensorDataset(X, y), batch_size=32, shuffle=True)

Performance tip: Prefer PyTorch operations over NumPy inside training loops — they run on GPU and support autograd. Use NumPy for preprocessing and data loading.

SECTION 08

NumPy vs PyTorch Performance

Operation	NumPy (CPU)	PyTorch (CPU)	PyTorch (GPU)	Notes
Matrix multiply (1K×1K)	0.8 ms	0.9 ms	0.05 ms	GPU ~16x faster
Element-wise ops	0.2 ms	0.2 ms	0.01 ms	GPU overhead dominates for small ops
Batch norm (BS=32)	N/A	1.5 ms	0.1 ms	PyTorch has native support

NumPy for ML preprocessing: NumPy excels at vectorized operations on large arrays, essential for preprocessing pipelines. Batch normalization, standardization, and other preprocessing steps are implemented efficiently in NumPy. Loading datasets, computing statistics, and transforming features can all be done in pure NumPy before converting to PyTorch tensors. For CPU-only preprocessing on terabyte-scale datasets, NumPy provides the best balance of simplicity and performance.

Broadcasting rules can be confusing but become second nature with practice. The rule: shapes are compatible if dimensions match from right to left, treating missing dimensions as size-1. Common patterns: adding a scalar to an array (broadcasts scalar), adding a row vector to all rows of a matrix (broadcasts row), adding a column vector to all columns (requires reshaping). Mastering broadcasting eliminates explicit loops and makes code both faster and clearer.

NumPy memory layout (C-contiguous vs Fortran-contiguous) affects performance on large operations. C-contiguous (row-major) is default and usually optimal for NumPy operations. However, transposing arrays or reshaping requires care to avoid expensive copies. Using numpy.ascontiguousarray() or checking array flags (arr.flags) helps debug unexpected slowdowns. For extreme performance requirements, libraries like numpy.linalg delegate to BLAS/LAPACK, so understanding when operations are backend-delegated helps optimize pipelines.

EXTRA

NumPy Ecosystem and Extensions

While NumPy is powerful, its pure-Python backend limits performance for some tasks. NumPy's C API allows libraries to integrate with NumPy arrays without copying. Packages like Numba enable JIT compilation of NumPy code to near-C speeds. CuPy provides a CUDA-accelerated NumPy-like interface for GPU computation, with nearly identical APIs making CPU→GPU code migration straightforward.

The NumPy ecosystem includes specialized libraries: SciPy for scientific computing, scikit-learn for machine learning preprocessing, pandas for tabular data, and xarray for labeled multidimensional arrays. Understanding when to use each tool prevents reinventing wheels. NumPy's central role in the Python data science stack makes it invaluable for understanding the broader ecosystem.

Modern data loading pipelines often use NumPy's memory mapping for large files, enabling out-of-core computation where only subsets of data fit in memory. Structured arrays (NumPy arrays with named fields) efficiently represent heterogeneous data like image metadata. Advanced indexing with boolean masks and fancy indexing make data filtering concise and performant compared to explicit loops.

NumPy's relationship with compiled extensions is important for performance. Operations like matrix multiplication delegate to BLAS libraries (OpenBLAS, Intel MKL) which are highly optimized. NumPy's linalg module wraps LAPACK for linear algebra. Using these optimized routines instead of pure Python loops provides 100-1000x speedups. Knowing which operations are backend-delegated helps optimize preprocessing pipelines.

For truly large-scale datasets (terabytes on disk), Dask extends NumPy with distributed arrays, enabling out-of-core computation across multiple machines. Dask arrays follow NumPy's API, making the transition from single-machine NumPy to distributed Dask straightforward. Understanding NumPy is prerequisite knowledge for scaling to distributed data processing.

NumPy's einsum (Einstein summation) notation provides a powerful abstraction for complex tensor operations. Instead of thinking about axis orders and reshapes, you write the index pattern directly. This notation clarifies intent and often enables NumPy to optimize the computation graph. Learning einsum well unlocks elegant solutions to operations that would otherwise require multiple reshape and transpose calls.

BOOST

NumPy for Numerical Computing and Scientific Applications

Beyond machine learning, NumPy is the foundation for scientific computing in Python. Physics simulations, climate modeling, molecular dynamics, and computational chemistry all rely on NumPy's efficient array operations. The mathematical rigor and performance of NumPy-based code attracts scientists from diverse fields. Understanding NumPy deeply opens doors to interdisciplinary scientific computing projects that leverage machine learning techniques.

Numerical stability is critical in scientific computing. Operations like solving linear systems (matrix inversion) can be ill-conditioned, producing wildly different results with tiny perturbations in input. NumPy delegates to stable LAPACK implementations, but practitioners must understand condition numbers and numerical issues. Combined with matplotlib for visualization, NumPy enables reproducible scientific research.

Performance optimization of NumPy code often involves avoiding Python loops, using broadcasting instead, leveraging BLAS through linalg operations, and memory-alignment aware indexing. Profiling NumPy code (using line_profiler or cProfile) identifies bottlenecks. Common patterns like sum reductions, matrix multiplications, and convolutions should delegate to optimized backends, not pure NumPy loops. Mastering these patterns unlocks high-performance numeric computing.

NumPy

Table of Contents

Why NumPy?

Array Creation & Indexing

Broadcasting

Linear Algebra Ops

Common ML Preprocessing

NumPy → PyTorch

NumPy Gotchas & Performance

NumPy vs PyTorch Performance

NumPy Ecosystem and Extensions

NumPy for Numerical Computing and Scientific Applications

NumPy

Table of Contents

Why NumPy?

Array Creation & Indexing

Broadcasting

Linear Algebra Ops

Common ML Preprocessing

NumPy → PyTorch

NumPy Gotchas & Performance

NumPy vs PyTorch Performance

NumPy Ecosystem and Extensions

NumPy for Numerical Computing and Scientific Applications

Related concepts