Transformer Architecture

Contents

The big picture
Self-attention
Multi-head attention
Feed-forward networks
Positional encoding
Training stability
Architecture variants

01 — THE STACK

The Big Picture

A Transformer is a stack of identical layers. Each layer contains two main sublayers: Multi-Head Attention and Feed-Forward Network. The input is token embeddings (vocabulary → d_model dimensional vector), and the output is a probability distribution over the vocabulary for the next token.

Model Hyperparameters

The following hyperparameters define model size and capacity:

d_model: embedding dimension (width of hidden states)
n_layers: number of stacked blocks (depth)
n_heads: number of attention heads per layer
d_ff: intermediate dimension of feed-forward network, usually 4× d_model

Example Model Configurations

Model	d_model	Layers	Heads	d_ff	Params
GPT-2 Small	768	12	12	3072	117M
Llama 2-7B	4096	32	32	11008	7B
Llama 3-70B	8192	80	64	28672	70B
GPT-4 (est)	12288	96	96	49152	~1.8T (MoE)

02 — THE CORE MECHANISM

Self-Attention: The Core Mechanism

Each token attends to all other tokens in the sequence, asking "which other tokens are relevant to me?" This is the fundamental operation that gives Transformers their power and expressiveness.

The Computation

For input X, three learned weight matrices project the input into Query (Q), Key (K), and Value (V) spaces:

Q = XW_Q
K = XW_K
V = XW_V

Attention score is then: softmax(QK^T / sqrt(d_k)) → weighted sum of V

Causal Masking

In decoder-only models (for generation), each token can only attend to previous tokens (causal masking). The upper triangle of the attention matrix is masked to -inf before softmax, preventing the model from looking into the future.

Scaled Dot-Product Attention in NumPy

import numpy as np def attention(Q, K, V, mask=None): d_k = Q.shape[-1] scores = Q @ K.T / np.sqrt(d_k) # [seq, seq] if mask is not None: scores = scores + mask * -1e9 # causal mask weights = np.exp(scores) / np.exp(scores).sum(-1, keepdims=True) # softmax return weights @ V # [seq, d_v] # For a 4-token sequence, d_k=64 seq_len, d_k = 4, 64 Q = np.random.randn(seq_len, d_k) K = np.random.randn(seq_len, d_k) V = np.random.randn(seq_len, d_k) # Causal mask: token i cannot see token j > i mask = np.triu(np.ones((seq_len, seq_len)), k=1) out = attention(Q, K, V, mask)

⚠️ The sqrt(d_k) scaling prevents dot products from becoming too large, which would push softmax into saturated regions with near-zero gradients during backprop. This is crucial for training stability.

03 — MULTIPLE RELATIONSHIP TYPES

Multi-Head Attention

Run attention H times in parallel, each with different learned W_Q, W_K, W_V projections. Different heads learn different relationship types: one head might focus on syntax, another on coreference, another on positional relationships. This diversity improves representation quality.

Head Output Combination

Concatenate all head outputs and project back to d_model with W_O (output projection). This allows the model to combine diverse attention perspectives.

GQA: Grouped Query Attention

Modern models use Grouped Query Attention (GQA) instead of full Multi-Head Attention. GQA shares K/V heads across groups of Q heads, reducing KV cache memory by 4-8× with minimal quality loss. This is critical for long-context inference.

Attention Head Variants Comparison

Variant	K/V heads per Q head	KV cache size	Used by
MHA (Multi-Head Attention)	1:1 ratio	100%	GPT-2, early BERT
GQA (Grouped Query Attention, g=4)	1:4 ratio	25%	Llama 3, Mistral, Gemma
MQA (Multi-Query Attention)	1:all ratio	1/H	Falcon, early PaLM

✓ Modern frontier models use GQA as standard. If benchmarking memory for a model, check whether it uses MHA or GQA — it changes KV cache requirements significantly. This is why newer models have better memory efficiency at scale.

04 — KNOWLEDGE STORAGE

Feed-Forward Network (FFN)

After attention, each token representation passes through a 2-layer FFN independently (no cross-token interaction). The FFN acts as a "knowledge store": research shows factual knowledge is stored in FFN weights, while attention routes information.

FFN Structure

FFN = Linear(d_model → d_ff) → Activation → Linear(d_ff → d_model). The d_ff is typically 4× d_model, creating a bottleneck-then-expand pattern.

SwiGLU Activation

Modern models (Llama, PaLM, Gemini) use SwiGLU activation instead of ReLU or GELU. SwiGLU splits the intermediate into two paths — one through SiLU activation, multiply together. Empirically superior to older activations.

SwiGLU FFN in PyTorch

import torch import torch.nn as nn import torch.nn.functional as F class SwiGLUFFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.w1 = nn.Linear(d_model, d_ff, bias=False) self.w3 = nn.Linear(d_model, d_ff, bias=False) self.w2 = nn.Linear(d_ff, d_model, bias=False) def forward(self, x): return self.w2(F.silu(self.w1(x)) * self.w3(x))

Parameter Distribution

The FFN accounts for roughly 2/3 of a model's parameters. This is where most factual knowledge lives. When fine-tuning, research shows that LoRA targeting FFN layers often outperforms targeting only attention layers for knowledge-intensive tasks.

✓ FFN is where factual knowledge lives. When implementing low-rank adaptation (LoRA) for fine-tuning, targeting FFN layers often gives better results than attention-only approaches for knowledge-heavy domains.

05 — WORD ORDER

Positional Encoding

Attention is permutation-invariant — without position information, the model has no sense of word order. Positional encoding adds position information to embeddings so the model understands sequence structure.

Positional Encoding Methods

Sinusoidal (original Transformer): fixed, non-learned position embeddings. Limited extrapolation.
Learned absolute: trainable embeddings. Hard cap at training length.
RoPE (Rotary Position Embedding): encode position as rotation of Q/K vectors. Allows relative positions to emerge naturally. Good extrapolation.
ALiBi (Attention with Linear Biases): add linear bias to attention scores based on distance. Generalizes to longer sequences than trained on.

Positional Encoding Comparison

Method	Extrapolation	Relative positions	Used by
Sinusoidal	Poor	No	Original Transformer
Learned absolute	None (hard cap)	No	GPT-2, BERT
RoPE	Good (with extension)	Yes	Llama, Mistral, Gemini
ALiBi	Good	Yes	BLOOM, MPT

RoPE Extension for Long Context

RoPE can be extended beyond training context length using "RoPE scaling" techniques like YaRN or dynamic NTK scaling. This is how Llama 3's 8K training context was extended to 128K without major quality loss.

⚠️ RoPE scaling is critical for long-context LLMs. When extending context beyond training length, proper positional encoding scaling prevents the model from "losing" position information in the middle of long sequences.

06 — NORMALIZATION AND RESIDUALS

LayerNorm, Residual Connections, and Training Stability

Deep networks require techniques to prevent gradient vanishing and activation explosion. Modern Transformers use residual connections and LayerNorm to maintain stable gradient flow and activation magnitudes throughout training.

Residual Connections

Output = x + sublayer(x). Allows gradients to flow directly back to earlier layers without being multiplied by small values, preventing vanishing gradients in deep networks.

LayerNorm

Normalize activations across features (not batch) at each sublayer. Prevents activation explosion and keeps training stable.

Pre-norm vs Post-norm

Modern models use Pre-LayerNorm: normalize input BEFORE sublayer. This improves training stability compared to the original post-norm design (normalize AFTER).

Transformer Block in PyTorch (Pre-norm Architecture)

class TransformerBlock(nn.Module): def __init__(self, d_model, n_heads, d_ff): super().__init__() self.attn = MultiHeadAttention(d_model, n_heads) self.ffn = SwiGLUFFN(d_model, d_ff) self.norm1 = nn.RMSNorm(d_model) # pre-norm self.norm2 = nn.RMSNorm(d_model) def forward(self, x, mask=None): # Pre-norm: normalize BEFORE sublayer x = x + self.attn(self.norm1(x), mask) # residual + attn x = x + self.ffn(self.norm2(x)) # residual + ffn return x

RMSNorm

Root Mean Square Norm: simpler than LayerNorm (no mean centering). Used by Llama, Mistral — slightly faster and empirically equivalent.

07 — ENCODER, DECODER, OR BOTH

Encoder-Only, Decoder-Only, Encoder-Decoder

Different task requirements call for different architectural patterns. Understanding when to use each is essential for model selection and fine-tuning design.

Architecture Variants Comparison

Architecture	Masking	KV cache	Use cases	Examples
Encoder-only	Bidirectional	No	Embeddings, classification, NER	BERT, RoBERTa, DeBERTa
Decoder-only	Causal (left-to-right)	Yes	Text generation, chat, coding	GPT, Llama, Mistral, Claude
Encoder-decoder	Enc: bidir, Dec: causal	Yes	Translation, summarization, T2T	T5, FLAN-T5, mT5, Whisper

Modern LLM Landscape

Modern LLMs are almost exclusively decoder-only: simpler to scale, KV cache enables fast autoregressive generation, in-context learning works naturally. Encoder-only models (BERT-family) remain dominant for: embedding generation, token classification, and sentence similarity tasks.

When to Use Each

Decoder-only: For any generation task, chat, coding, question-answering. Default choice for modern LLMs.
Encoder-only: For dense embeddings, classification, token-level tasks without generation.
Encoder-decoder: When you need bidirectional context for understanding AND generation capability (translation, summarization).

Tools for Architecture Implementation

transformers (HF)

Framework

Standard library for all model architectures. Essential reference for training and inference.

PyTorch

Framework

Build custom layers and architectures. Used by all major implementations.

FlashAttention-2

Optimization

Fast attention implementation. Reduces memory and compute for long sequences.

xformers

Optimization

Memory-efficient transformer implementations. Experimental but powerful.

DeepSpeed

Scaling

Distributed training for huge models. ZeRO stages enable billion-parameter training.

FSDP

Scaling

PyTorch's native fully sharded data parallel. Standard for distributed training.

References

Foundational Papers

Paper Vaswani, A. et al. (2017). Attention Is All You Need. The original Transformer paper. arXiv:1706.03762. — arxiv:1706.03762 ↗

Position Encoding & Improvements

Paper Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864. — arxiv:2104.09864 ↗
Paper Press, O. et al. (2022). ALiBi: Train Short, Test Long. Attention with Linear Biases for long-context extrapolation. arXiv:2108.12409. — arxiv:2108.12409 ↗

Optimization & Efficiency

Paper Ainslie, J. et al. (2023). GQA: Training Generalized Multi-Query Transformers. Grouped Query Attention. arXiv:2305.13245. — arxiv:2305.13245 ↗
Paper Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. Multi-Query Attention. arXiv:1911.02727. — arxiv:1911.02727 ↗
Paper Dao, T. et al. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691. — arxiv:2307.08691 ↗

Transformer Architecture

The Big Picture

Model Hyperparameters

Example Model Configurations

Self-Attention: The Core Mechanism

The Computation

Causal Masking

Scaled Dot-Product Attention in NumPy

Multi-Head Attention

Head Output Combination

GQA: Grouped Query Attention

Attention Head Variants Comparison

Feed-Forward Network (FFN)

FFN Structure

SwiGLU Activation

SwiGLU FFN in PyTorch

Parameter Distribution

Positional Encoding

Positional Encoding Methods

Positional Encoding Comparison

RoPE Extension for Long Context

LayerNorm, Residual Connections, and Training Stability

Residual Connections

LayerNorm

Pre-norm vs Post-norm

Transformer Block in PyTorch (Pre-norm Architecture)

RMSNorm

Encoder-Only, Decoder-Only, Encoder-Decoder

Architecture Variants Comparison

Modern LLM Landscape

When to Use Each

Tools for Architecture Implementation

References

Related concepts