SECTION 01
Why Activations?
Without a nonlinearity between layers, a stack of N linear layers collapses to a single linear layer: W_n W_{n-1} ... W_1 x = W_combined x. Activation functions break this — they give neural networks the ability to approximate any function (universal approximation theorem).
Core idea: Activation functions introduce curvature. A network with activations can carve out arbitrary decision boundaries; without them, it can only learn linear separations.
SECTION 02
ReLU Family
import torch
import torch.nn.functional as F
x = torch.randn(100)
# ReLU: max(0, x) — simple, fast, default for CNNs
out = F.relu(x) # 0 for x<=0, x for x>0
# Dead ReLU problem: neurons with x<0 always output 0, gradient=0 — stuck
# Fix 1: Leaky ReLU — small slope for negative values
out = F.leaky_relu(x, negative_slope=0.01)
# Fix 2: ELU — smooth negative region
out = F.elu(x, alpha=1.0)
# GELU approx via tanh (used in BERT/GPT):
# Note: actual GELU is x * Phi(x) where Phi = standard normal CDF
out = F.gelu(x) # PyTorch has the exact version
SECTION 03
GELU & SiLU
GELU (Gaussian Error Linear Unit) is the standard activation in transformers. It's smooth everywhere and approximates ReLU but with better gradient flow at x≈0.
import torch
import torch.nn.functional as F
import numpy as np
x = torch.linspace(-4, 4, 100)
# GELU: x * Phi(x), where Phi = standard normal CDF
# Approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
gelu_out = F.gelu(x)
# SiLU / Swish: x * sigmoid(x) — also smooth, used in MobileNet, EfficientNet
silu_out = F.silu(x) # = x * torch.sigmoid(x)
# Key properties:
# - GELU/SiLU are smooth everywhere (unlike ReLU which has a kink at 0)
# - Non-monotonic: slightly negative for small negative x
# - Better gradient flow than ReLU for deep transformers
# - ~20% slower to compute than ReLU (acceptable tradeoff)
# In nn.Module:
import torch.nn as nn
layer = nn.Sequential(
nn.Linear(768, 3072),
nn.GELU(), # Standard for BERT, GPT-2, GPT-3
nn.Linear(3072, 768)
)
SECTION 04
SwiGLU in Modern LLMs
SwiGLU (Swish-Gated Linear Unit) is used in LLaMA, Mistral, and GPT-4. It gates the activation with a learned second linear projection — more expressive but requires 2 weight matrices per FFN layer.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SwiGLU(nn.Module):
"""Feed-forward layer with SwiGLU activation (as in LLaMA-2)."""
def __init__(self, d_model: int, d_ffn: int):
super().__init__()
# Two parallel projections instead of one
self.gate = nn.Linear(d_model, d_ffn, bias=False)
self.up = nn.Linear(d_model, d_ffn, bias=False)
self.down = nn.Linear(d_ffn, d_model, bias=False)
def forward(self, x):
# SwiGLU: gate(x) * silu(up(x))
return self.down(self.gate(x) * F.silu(self.up(x)))
# Usage in a transformer block
d_model = 4096
d_ffn = int(2/3 * 4 * d_model) # LLaMA-style: 2/3 * 4 * d_model
ffn = SwiGLU(d_model, d_ffn)
# Why 2/3 * 4 * d_model? SwiGLU uses 2 matrices instead of 1,
# so total params ≈ 2 * (d_model * d_ffn_small) ≈ original 1 * (d_model * d_ffn_big)
SECTION 05
Comparison Table
| Activation | Formula | Used In | Notes |
| ReLU | max(0, x) | CNNs, older NLP | Fast; dead neuron problem |
| GELU | x·Φ(x) | BERT, GPT-2/3, ViT | Smooth; standard for transformers |
| SiLU/Swish | x·σ(x) | EfficientNet, LLaMA-1 | Similar to GELU; slightly faster |
| SwiGLU | gate(x)·silu(up(x)) | LLaMA-2/3, Mistral, GPT-4 | Gated; best empirical performance |
| GeGLU | gate(x)·gelu(up(x)) | PaLM, T5v1.1 | Same structure as SwiGLU with GELU |
SECTION 06
Implementation Details
import torch
import torch.nn as nn
# The "standard" transformer FFN with GELU (BERT-style):
class FFN_GELU(nn.Module):
def __init__(self, d_model, d_ffn=None):
super().__init__()
d_ffn = d_ffn or 4 * d_model
self.fc1 = nn.Linear(d_model, d_ffn)
self.act = nn.GELU()
self.fc2 = nn.Linear(d_ffn, d_model)
def forward(self, x):
return self.fc2(self.act(self.fc1(x)))
# Benchmark: which is fastest on modern GPUs?
# ReLU: ~1.0x (baseline)
# GELU: ~1.2x slower (negligible in practice)
# SwiGLU: ~1.4x slower per-op, but needs smaller d_ffn
# Gradient behavior check
x = torch.randn(1000, requires_grad=True)
F.gelu(x).sum().backward()
print(f"GELU gradient norm: {x.grad.norm():.4f}")
# Should be non-zero everywhere — no dead neuron problem
Default choice: GELU for encoder-style models (BERT, ViT). SwiGLU for decoder-style LLMs (LLaMA, Mistral). For new architectures, SwiGLU is the current best practice.
Activation Function Variants
Beyond the commonly used activations, researchers have developed numerous variants targeting specific architectural needs. Functions like Mish, Swish variants, and Mixture of Activations provide alternative gradient flow properties and computational characteristics. Each activation function carries implicit biases about the distribution of outputs and gradient magnitudes, affecting convergence rates and final model performance. Understanding these trade-offs helps practitioners select the most appropriate function for their specific architecture and problem domain.
Modern architectures increasingly use multiple activation functions strategically. For example, Transformers use GLU variants in feed-forward layers while maintaining simplicity in attention mechanisms. This selective deployment of different activations can improve both training efficiency and model capacity.
| Function | Formula | Best For | Computational Cost |
| Mish | x * tanh(softplus(x)) | Vision & dense models | High |
| Swish | x * sigmoid(x) | NLP models | Medium |
| GELU | x * Φ(x) | Transformer layers | Medium-High |
| SwiGLU | (x*W+b) ⊗ sigmoid(x*V+c) | LLM feed-forwards | Medium |
Practical Selection Criteria
When choosing an activation function, consider several interconnected factors. Computational cost matters in deployment scenarios where inference throughput is critical. Gradient flow properties affect training stability and convergence speed, particularly important in deep networks. Empirical performance on your specific task can differ significantly from general benchmarks, making experiments essential for optimal results.
The choice of activation function interacts with other hyperparameters including learning rate, batch normalization placement, and weight initialization schemes. ReLU works well with He initialization but might underperform with other initialization strategies. Modern practice increasingly validates multiple activation choices rather than relying on conventional wisdom, as architectural changes can shift which functions perform optimally.
For production systems, consider compatibility with your deployment hardware and software stack. Some activations compile more efficiently to various target platforms. Additionally, activation function choice affects the interpretability of learned representations, with some functions producing more sparse or structured activation patterns than others.
import torch
import torch.nn as nn
# Comparing activation functions in practice
model = nn.Sequential(
nn.Linear(784, 256),
nn.GELU(), # Modern choice for dense layers
nn.Linear(256, 128),
nn.SiLU(), # Also called Swish
nn.Linear(128, 10)
)
# For vision, you might prefer:
conv_model = nn.Sequential(
nn.Conv2d(3, 64, 3),
nn.Mish(), # Often better for conv layers
nn.BatchNorm2d(64),
nn.Conv2d(64, 128, 3),
nn.Mish()
)
Related concepts
The relationship between activation functions and model capacity is an active area of research. Some theoretical work suggests that certain activation families preserve expressivity better than others across different depths. Empirical observations from training large models confirm that activation function choice impacts not just speed but also final accuracy. Teams building state-of-the-art systems regularly experiment with custom or less-common activation functions to push performance boundaries.
In distributed training scenarios, the computational cost of activation functions can become significant. When scaling to thousands of GPUs, even modest improvements in activation function efficiency compound substantially. This has led major AI labs to sometimes develop custom CUDA kernels for their preferred activation functions, ensuring maximum throughput on their hardware targets.
Activation function behavior changes subtly with mixed precision training. Some functions handle reduced precision better than others, making them more suitable for quantization and model compression workflows. This consideration becomes increasingly important as models grow larger and deployment constraints tighten.
Research into sparsity and efficient inference has shown that activation functions can be designed to produce sparser activation patterns. Functions that naturally encourage sparsity can reduce computational requirements during inference while maintaining model quality. This trend is particularly relevant for mobile and edge deployment scenarios where compute resources are limited.
The interplay between activation functions, layer normalization, and other architectural components creates complex system dynamics. Recent research suggests that optimal activation functions depend not just on the task but on the entire training configuration including learning rate schedules, weight decay, and normalization strategies. This context-dependent nature makes it difficult to establish universal best practices, requiring empirical validation for each new architecture.
Future directions in activation function research include learned activations where the function itself becomes a learnable component of the neural network. Parametric activation functions like PReLU already explore this direction. Additionally, activation functions optimized for specific hardware platforms—GPUs, TPUs, or specialized accelerators—offer opportunities for co-design of algorithms and hardware. As neural networks continue to evolve toward sparse, mixture-of-experts, and multimodal architectures, new activation functions will likely emerge to optimize their unique computational characteristics and information flow patterns.
The choice of activation function has implications for interpretability and understanding what patterns networks learn. Different activations produce different activation distributions across neurons, affecting which features become sparse or dense. This property is relevant when researchers attempt to understand and visualize what neural networks have learned from their training data.
This remains an active area of investigation across the deep learning community as researchers continue to develop new activation functions and explore their theoretical properties and practical implications for modern neural network architectures.