TRANSFORMER ARCHITECTURE

The Transformer Architecture

Self-attention powers every modern LLM — the architecture that changed everything

Q·K·V = attention the core operation
encoder / decoder / both three variants
O(n²) → Flash Attn the efficiency story
Contents
  1. Why transformers replaced RNNs
  2. The self-attention mechanism
  3. Architecture variants
  4. Positional encoding
  5. Efficiency innovations
  6. Working code example
  7. What to explore next
  8. References
01 — Foundation

Why Transformers Replaced RNNs

Recurrent Neural Networks (RNNs, LSTMs) process sequences token-by-token, one step at a time. This sequential dependency made them slow to train (no parallelization) and poor at capturing long-range dependencies. Vanishing gradients meant information from early tokens faded by the time you reached token 512.

Transformers (Vaswani et al., 2017) solved this with a radical insight: replace sequential processing with parallel attention. Every token attends to every other token simultaneously, computing relationships in a single pass. This enables massive parallelization and removes the gradient flow bottleneck.

Key Advantages Over RNNs

PropertyRNN/LSTMTransformer
ParallelizationSequential onlyFully parallel
Long-range dependenciesWeak (vanishing gradients)Strong (direct connections)
Training speedSlow (O(n) steps)Fast (O(1) steps)
Memory efficiencyO(d) hidden stateO(n²) attention
Inference latencyLow latencyHigher latency (until KV cache)
💡 Key insight: Transformers are fundamentally parallelizable. They scale with compute; RNNs are fundamentally sequential. This explains why scaling laws favor transformers — you can train on more data in less wall-clock time.
Use context: Transformers are the architecture underpinning all modern LLMs; attention mechanism, tokenization, and embeddings are universal across models.
02 — Core Mechanism

The Self-Attention Mechanism

Self-attention computes a weighted sum of token representations, where the weights reflect relevance between pairs of tokens. The mechanism requires three transformations: Query (Q), Key (K), Value (V).

The Q, K, V Framework

For a sequence of tokens, we project each token into three spaces:

x_i = embedding of token i (dim = 512) Q_i = linear(x_i) # Query (what am I looking for?) K_i = linear(x_i) # Key (what am I?) V_i = linear(x_i) # Value (what information do I contain?) Attention scores: scores_ij = Q_i · K_j / sqrt(d_k) Attention weights: attn_ij = softmax(scores_ij) over all j Output_i = Σ_j attn_ij * V_j # weighted sum of values

The division by sqrt(d_k) stabilizes gradients; the softmax ensures weights sum to 1. The result is that each token becomes a weighted blend of all tokens' values, guided by query-key similarity.

Multi-Head Attention

A single attention head captures one type of relationship. Multi-head attention uses 8–12 parallel heads, each learning different attention patterns (syntax, semantics, long-range, short-range). The outputs are concatenated and projected back to the model dimension.

# 8 attention heads, each with dimension 64 (512 / 8) for h in range(8): Q_h, K_h, V_h = project_to_head_dim(Q, K, V, h) head_output_h = softmax(Q_h @ K_h.T / sqrt(64)) @ V_h output = linear(concat(head_output_0, ..., head_output_7))

Causal vs. Non-Causal Attention

Causal (decoder-only, GPT-style): Token i can only attend to tokens 0..i (past and present), not future tokens. Enforced by setting scores to -∞ for j > i before softmax. Non-causal (BERT-style): All tokens attend to all tokens. Used in encoder-only models where future context is available.

⚠️ Attention O(n²) complexity: Attention requires computing similarity between every pair of tokens — quadratic in sequence length. This limits context windows. For n=4096 tokens, you compute 16M similarity scores.
Key takeaway: Attention = key innovation; multi-head, causal, sliding-window variants. Tokenization is non-trivial; byte-pair encoding (BPE) is standard. Attention O(n²) complexity limits context; newer variants trade latency.
03 — Design Patterns

Architecture Variants: Encoder, Decoder, Encoder-Decoder

Transformers come in three flavors, each suited to different tasks.

Encoder-Only (BERT, RoBERTa)

Bidirectional attention: tokens see all other tokens. Trained with masked language modeling (15% of tokens masked, model predicts them). Excels at understanding and classification tasks (sentiment, NER, token classification). No autoregressive generation.

Input: [CLS] The quick [MASK] jumps [SEP] Task: Predict the masked token Output: "brown" (from contextual representations) Good for: Text classification, named entity recognition, semantic similarity

Decoder-Only (GPT-2, GPT-4, Llama)

Causal attention: each token attends only to past tokens. Trained with next-token prediction (given tokens 0..i, predict i+1). Naturally generates sequences autoregressively, one token at a time. All modern LLMs are decoder-only.

Input: "The quick brown" Attend: [position 0] → [0], [position 1] → [0,1], [position 2] → [0,1,2] Output: Logits for next token (predicting "fox")

Encoder-Decoder (T5, BART, Seq2Seq)

Encoder processes input bidirectionally; decoder attends causally to its own tokens and cross-attends to encoder outputs. Designed for tasks with separate input and output (translation, summarization, question answering).

Encoder input: "translate to French: Hello world" Encoder: [bidirectional attention] Decoder: [causal attention] + [cross-attention to encoder outputs] Decoder output: "Bonjour le monde"
ArchitectureAttention typeTasksExamples
Encoder-onlyBidirectionalClassification, taggingBERT, RoBERTa
Decoder-onlyCausalLanguage generationGPT-4, Llama, Falcon
Encoder-decoderCausal + crossSeq2seq, translationT5, BART, Pegasus
Best practice: All modern LLMs (ChatGPT, Claude, Llama) are decoder-only. This single architecture dominates because generation is the most versatile task.
04 — Position

Positional Encoding

Transformers have no notion of sequence order — attention is permutation-invariant. To inject position information, we add positional encodings to token embeddings before feeding them to the first layer.

Absolute Positional Encodings

Sinusoidal (Vaswani, 2017): Position p, dimension d_model. Compute sin and cos at different frequencies. Learnable across context length. Still used in many models.

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Relative Positional Encodings: RoPE

RoPE (Rotary Position Embeddings, Su et al., 2021) applies rotation matrices in the query-key dot product. More sample-efficient than sinusoidal; naturally extrapolates to longer sequences. Used in Llama, PaLM, modern models.

# Rotate Q, K by position-dependent angle theta_m = base^(-2m/d) rotation_angle = pos * theta_m Q_rot = R(pos) * Q K_rot = R(pos) * K

Relative Positional Encodings: ALiBi

Attention with Linear Biases (ALiBi, Press et al., 2022) adds a position-dependent bias directly to attention logits before softmax. Extremely simple; extrapolates well. Used in some smaller models.

scores_ij = Q_i · K_j - |i - j| * alpha # Alpha is a small learned coefficient per head
💡 Why positional encoding matters: Without it, the transformer sees "cat sat the on" as equivalent to "the cat sat on" — position is lost. With it, the model learns that subject → verb → object order matters.
05 — Scaling

Efficiency Innovations: KV Cache, Flash Attention, and Friends

Transformers are powerful but expensive. Three key innovations dramatically reduce computation and memory.

KV Cache (Key-Value Caching)

During inference, we generate tokens one at a time. Without caching, we'd recompute K and V for all previous tokens on every generation step. KV cache stores K, V for all past tokens, reusing them. This cuts attention computation from O(n²) to O(n) during inference.

# Without cache: recompute attention for all tokens at each step for step in range(max_tokens): output = attention(Q_new, K_all, V_all) # K_all, V_all recomputed new_token = argmax(output) # With cache: store K, V and append kv_cache = {} for step in range(max_tokens): K_new, V_new = compute_once(token_new) kv_cache[step] = (K_new, V_new) output = attention(Q_new, kv_cache[:step+1], kv_cache[:step+1])

Flash Attention

Dao et al. (2022) observed that standard attention implementations read/write to slow GPU memory repeatedly. Flash Attention reorders computation to exploit fast SRAM (on-chip memory). Same result, 4–10x speedup, 2–5x less memory. Now standard in modern libraries.

Impact: Flash Attention made long-context training practical. Reduced attention cost from main bottleneck to roughly equal with feed-forward layers.

Grouped-Query Attention (GQA)

Standard multi-head attention has 8–12 separate K, V projections (one per head), ballooning KV cache size. GQA uses a small number of shared K, V heads (e.g., 2) with multiple Q heads attending to them. Reduces KV cache by 8x; minimal quality loss.

Sliding Window Attention

Full attention is O(n²); for very long contexts, we can't afford it. Sliding window attention limits each token to attend to the last W tokens (e.g., W=4096). Reduces complexity to O(n*W). Used in Mistral, LLaMA 2 Long.

TechniqueImpactTrade-off
KV cache4–100x speedup at inferenceMemory for cache storage
Flash Attention4–10x training speedupImplementation complexity
Grouped-Query Attention8x KV cache reductionMinimal quality loss
Sliding windowLinear complexity for long contextLoses very long-range attention
⚠️ Production reality: Modern inference uses KV cache + Flash Attention + GQA together. Even so, a 70B model at 8K context takes significant time. Token generation is inherently sequential and memory-bound.
06 — Practice

Working Code Example

Let's load a pre-trained transformer and inspect its internal representations, attention weights, and how self-attention works in practice.

from transformers import AutoTokenizer, AutoModel import torch # Load a decoder-only model (GPT-style) model_id = "gpt2" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModel.from_pretrained(model_id) text = "The transformer architecture revolutionized NLP" inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs, output_attentions=True) print(f"Hidden state shape: {outputs.last_hidden_state.shape}") # Shape: [batch, sequence_length, hidden_size] # Inspect attention weights (last layer, first head) attn = outputs.attentions[-1][0, 0] # [seq_len, seq_len] print(f"Attention matrix shape: {attn.shape}") # Check token-level representations tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) print("Tokens:", tokens)

Output Breakdown

last_hidden_state: [1, 10, 768] — The final layer representations for each token. 768 is the hidden dimension of GPT-2. attentions: Tuple of 12 tensors (one per layer). Each is [batch, num_heads, seq_len, seq_len]. Shows which tokens the model attends to. tokens: ["The", "Ġtransformer", "Ġarchitecture", ...] — Note BPE tokenization breaks words; "Ġ" is space.

Takeaway: Modern transformers are implemented in HuggingFace Transformers with a simple, consistent API across models. The attention mechanism is the core; everything else is engineering (layer norm, residuals, feed-forward, positional encoding).
07 — Deeper Dives

What to Explore Next

We've covered the architecture and efficiency. The ecosystem of transformer variants is vast. Here are four child concept pages that dive deeper:

1

Attention Mechanisms — Multi-head, causal, sparse variants

Dive into self-attention, cross-attention, different mask types, and modern variants like sparse attention and linear attention. Compare trade-offs.

→ Attention Mechanisms
2

Positional Encoding — RoPE, ALiBi, sinusoidal

Understand how transformers encode position, why it matters, and which methods extrapolate best. Explore interpolation vs. extrapolation.

→ Positional Encoding
3

Transformer Architectures — Encoder, decoder, encoder-decoder

Deep dive into encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) designs. When to use each. Training objectives and fine-tuning.

→ Transformer Architectures
4

KV Cache & Efficiency — Inference optimization, sparse attention

Master key-value caching, grouped-query attention, Flash Attention, speculative decoding. How production systems generate tokens at scale.

→ KV Cache & Efficiency
💡 Learning path: Master self-attention first (this page). Then pick one child concept based on your focus: attention variants, positional methods, architecture design, or inference optimization. Each feeds into the others.
08 — Further Reading

References

Foundational Papers
Resources & Implementations
LEARNING PATH

Learning Path

The transformer is the architectural foundation of all modern LLMs. Build your understanding layer by layer:

Linear Algebramatrix multiply
AttentionQ, K, V
Pos. EncodingRoPE / sinusoidal
Full TransformerN layers + heads
Trainingnext-token prediction
1

Implement attention in NumPy

Write softmax(Q @ K.T / sqrt(d_k)) @ V from scratch. This 5-line calculation is the entire core of the transformer. Everything else is scaffolding.

2

Read "Attention Is All You Need"

The 2017 paper is still the clearest description of the architecture. Focus on Figure 1 and Section 3. The math follows directly from your NumPy implementation.

3

Build a tiny GPT in PyTorch

Andrej Karpathy's "nanoGPT" (GitHub) is the canonical exercise. Train a character-level model on Shakespeare. You'll see all the components fit together in <300 lines.

4

Understand modern variants

RoPE (rotary position encoding) replaced sinusoidal in most modern models. GQA (grouped query attention) replaced MHA for memory efficiency. SwiGLU replaced ReLU in the FFN. These are the differences between the original paper and Llama 3.