Neural Networks

Activation Functions

ReLU, GELU, SiLU, SwiGLU — nonlinearities that give transformers expressive power. Without them, stacking layers is just matrix multiplication.

GELU
Transformers
SwiGLU
LLaMA/GPT-4
ReLU
Classic

Table of Contents

SECTION 01

Why Activations?

Without a nonlinearity between layers, a stack of N linear layers collapses to a single linear layer: W_n W_{n-1} ... W_1 x = W_combined x. Activation functions break this — they give neural networks the ability to approximate any function (universal approximation theorem).

Core idea: Activation functions introduce curvature. A network with activations can carve out arbitrary decision boundaries; without them, it can only learn linear separations.
SECTION 02

ReLU Family

import torch import torch.nn.functional as F x = torch.randn(100) # ReLU: max(0, x) — simple, fast, default for CNNs out = F.relu(x) # 0 for x<=0, x for x>0 # Dead ReLU problem: neurons with x<0 always output 0, gradient=0 — stuck # Fix 1: Leaky ReLU — small slope for negative values out = F.leaky_relu(x, negative_slope=0.01) # Fix 2: ELU — smooth negative region out = F.elu(x, alpha=1.0) # GELU approx via tanh (used in BERT/GPT): # Note: actual GELU is x * Phi(x) where Phi = standard normal CDF out = F.gelu(x) # PyTorch has the exact version
SECTION 03

GELU & SiLU

GELU (Gaussian Error Linear Unit) is the standard activation in transformers. It's smooth everywhere and approximates ReLU but with better gradient flow at x≈0.

import torch import torch.nn.functional as F import numpy as np x = torch.linspace(-4, 4, 100) # GELU: x * Phi(x), where Phi = standard normal CDF # Approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))) gelu_out = F.gelu(x) # SiLU / Swish: x * sigmoid(x) — also smooth, used in MobileNet, EfficientNet silu_out = F.silu(x) # = x * torch.sigmoid(x) # Key properties: # - GELU/SiLU are smooth everywhere (unlike ReLU which has a kink at 0) # - Non-monotonic: slightly negative for small negative x # - Better gradient flow than ReLU for deep transformers # - ~20% slower to compute than ReLU (acceptable tradeoff) # In nn.Module: import torch.nn as nn layer = nn.Sequential( nn.Linear(768, 3072), nn.GELU(), # Standard for BERT, GPT-2, GPT-3 nn.Linear(3072, 768) )
SECTION 04

SwiGLU in Modern LLMs

SwiGLU (Swish-Gated Linear Unit) is used in LLaMA, Mistral, and GPT-4. It gates the activation with a learned second linear projection — more expressive but requires 2 weight matrices per FFN layer.

import torch import torch.nn as nn import torch.nn.functional as F class SwiGLU(nn.Module): """Feed-forward layer with SwiGLU activation (as in LLaMA-2).""" def __init__(self, d_model: int, d_ffn: int): super().__init__() # Two parallel projections instead of one self.gate = nn.Linear(d_model, d_ffn, bias=False) self.up = nn.Linear(d_model, d_ffn, bias=False) self.down = nn.Linear(d_ffn, d_model, bias=False) def forward(self, x): # SwiGLU: gate(x) * silu(up(x)) return self.down(self.gate(x) * F.silu(self.up(x))) # Usage in a transformer block d_model = 4096 d_ffn = int(2/3 * 4 * d_model) # LLaMA-style: 2/3 * 4 * d_model ffn = SwiGLU(d_model, d_ffn) # Why 2/3 * 4 * d_model? SwiGLU uses 2 matrices instead of 1, # so total params ≈ 2 * (d_model * d_ffn_small) ≈ original 1 * (d_model * d_ffn_big)
SECTION 05

Comparison Table

ActivationFormulaUsed InNotes
ReLUmax(0, x)CNNs, older NLPFast; dead neuron problem
GELUx·Φ(x)BERT, GPT-2/3, ViTSmooth; standard for transformers
SiLU/Swishx·σ(x)EfficientNet, LLaMA-1Similar to GELU; slightly faster
SwiGLUgate(x)·silu(up(x))LLaMA-2/3, Mistral, GPT-4Gated; best empirical performance
GeGLUgate(x)·gelu(up(x))PaLM, T5v1.1Same structure as SwiGLU with GELU
SECTION 06

Implementation Details

import torch import torch.nn as nn # The "standard" transformer FFN with GELU (BERT-style): class FFN_GELU(nn.Module): def __init__(self, d_model, d_ffn=None): super().__init__() d_ffn = d_ffn or 4 * d_model self.fc1 = nn.Linear(d_model, d_ffn) self.act = nn.GELU() self.fc2 = nn.Linear(d_ffn, d_model) def forward(self, x): return self.fc2(self.act(self.fc1(x))) # Benchmark: which is fastest on modern GPUs? # ReLU: ~1.0x (baseline) # GELU: ~1.2x slower (negligible in practice) # SwiGLU: ~1.4x slower per-op, but needs smaller d_ffn # Gradient behavior check x = torch.randn(1000, requires_grad=True) F.gelu(x).sum().backward() print(f"GELU gradient norm: {x.grad.norm():.4f}") # Should be non-zero everywhere — no dead neuron problem
Default choice: GELU for encoder-style models (BERT, ViT). SwiGLU for decoder-style LLMs (LLaMA, Mistral). For new architectures, SwiGLU is the current best practice.

Activation Function Variants

Beyond the commonly used activations, researchers have developed numerous variants targeting specific architectural needs. Functions like Mish, Swish variants, and Mixture of Activations provide alternative gradient flow properties and computational characteristics. Each activation function carries implicit biases about the distribution of outputs and gradient magnitudes, affecting convergence rates and final model performance. Understanding these trade-offs helps practitioners select the most appropriate function for their specific architecture and problem domain.

Modern architectures increasingly use multiple activation functions strategically. For example, Transformers use GLU variants in feed-forward layers while maintaining simplicity in attention mechanisms. This selective deployment of different activations can improve both training efficiency and model capacity.

FunctionFormulaBest ForComputational Cost
Mishx * tanh(softplus(x))Vision & dense modelsHigh
Swishx * sigmoid(x)NLP modelsMedium
GELUx * Φ(x)Transformer layersMedium-High
SwiGLU(x*W+b) ⊗ sigmoid(x*V+c)LLM feed-forwardsMedium

Practical Selection Criteria

When choosing an activation function, consider several interconnected factors. Computational cost matters in deployment scenarios where inference throughput is critical. Gradient flow properties affect training stability and convergence speed, particularly important in deep networks. Empirical performance on your specific task can differ significantly from general benchmarks, making experiments essential for optimal results.

The choice of activation function interacts with other hyperparameters including learning rate, batch normalization placement, and weight initialization schemes. ReLU works well with He initialization but might underperform with other initialization strategies. Modern practice increasingly validates multiple activation choices rather than relying on conventional wisdom, as architectural changes can shift which functions perform optimally.

For production systems, consider compatibility with your deployment hardware and software stack. Some activations compile more efficiently to various target platforms. Additionally, activation function choice affects the interpretability of learned representations, with some functions producing more sparse or structured activation patterns than others.

import torch
import torch.nn as nn

# Comparing activation functions in practice
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.GELU(),  # Modern choice for dense layers
    nn.Linear(256, 128),
    nn.SiLU(),   # Also called Swish
    nn.Linear(128, 10)
)

# For vision, you might prefer:
conv_model = nn.Sequential(
    nn.Conv2d(3, 64, 3),
    nn.Mish(),  # Often better for conv layers
    nn.BatchNorm2d(64),
    nn.Conv2d(64, 128, 3),
    nn.Mish()
)

This remains an active area of investigation across the deep learning community as researchers continue to develop new activation functions and explore their theoretical properties and practical implications for modern neural network architectures.