Transformer Architecture

Contents

Why transformers replaced RNNs
The self-attention mechanism
Architecture variants
Positional encoding
Efficiency innovations
Working code example
What to explore next
References

01 — Foundation

Why Transformers Replaced RNNs

Recurrent Neural Networks (RNNs, LSTMs) process sequences token-by-token, one step at a time. This sequential dependency made them slow to train (no parallelization) and poor at capturing long-range dependencies. Vanishing gradients meant information from early tokens faded by the time you reached token 512.

Transformers (Vaswani et al., 2017) solved this with a radical insight: replace sequential processing with parallel attention. Every token attends to every other token simultaneously, computing relationships in a single pass. This enables massive parallelization and removes the gradient flow bottleneck.

Key Advantages Over RNNs

Property	RNN/LSTM	Transformer
Parallelization	Sequential only	Fully parallel
Long-range dependencies	Weak (vanishing gradients)	Strong (direct connections)
Training speed	Slow (O(n) steps)	Fast (O(1) steps)
Memory efficiency	O(d) hidden state	O(n²) attention
Inference latency	Low latency	Higher latency (until KV cache)

💡 Key insight: Transformers are fundamentally parallelizable. They scale with compute; RNNs are fundamentally sequential. This explains why scaling laws favor transformers — you can train on more data in less wall-clock time.

✓ Use context: Transformers are the architecture underpinning all modern LLMs; attention mechanism, tokenization, and embeddings are universal across models.

02 — Core Mechanism

The Self-Attention Mechanism

Self-attention computes a weighted sum of token representations, where the weights reflect relevance between pairs of tokens. The mechanism requires three transformations: Query (Q), Key (K), Value (V).

The Q, K, V Framework

For a sequence of tokens, we project each token into three spaces:

x_i = embedding of token i (dim = 512) Q_i = linear(x_i) # Query (what am I looking for?) K_i = linear(x_i) # Key (what am I?) V_i = linear(x_i) # Value (what information do I contain?) Attention scores: scores_ij = Q_i · K_j / sqrt(d_k) Attention weights: attn_ij = softmax(scores_ij) over all j Output_i = Σ_j attn_ij * V_j # weighted sum of values

The division by sqrt(d_k) stabilizes gradients; the softmax ensures weights sum to 1. The result is that each token becomes a weighted blend of all tokens' values, guided by query-key similarity.

Multi-Head Attention

A single attention head captures one type of relationship. Multi-head attention uses 8–12 parallel heads, each learning different attention patterns (syntax, semantics, long-range, short-range). The outputs are concatenated and projected back to the model dimension.

# 8 attention heads, each with dimension 64 (512 / 8) for h in range(8): Q_h, K_h, V_h = project_to_head_dim(Q, K, V, h) head_output_h = softmax(Q_h @ K_h.T / sqrt(64)) @ V_h output = linear(concat(head_output_0, ..., head_output_7))

Causal vs. Non-Causal Attention

Causal (decoder-only, GPT-style): Token i can only attend to tokens 0..i (past and present), not future tokens. Enforced by setting scores to -∞ for j > i before softmax. Non-causal (BERT-style): All tokens attend to all tokens. Used in encoder-only models where future context is available.

⚠️ Attention O(n²) complexity: Attention requires computing similarity between every pair of tokens — quadratic in sequence length. This limits context windows. For n=4096 tokens, you compute 16M similarity scores.

✓ Key takeaway: Attention = key innovation; multi-head, causal, sliding-window variants. Tokenization is non-trivial; byte-pair encoding (BPE) is standard. Attention O(n²) complexity limits context; newer variants trade latency.

03 — Design Patterns

Architecture Variants: Encoder, Decoder, Encoder-Decoder

Transformers come in three flavors, each suited to different tasks.

Encoder-Only (BERT, RoBERTa)

Bidirectional attention: tokens see all other tokens. Trained with masked language modeling (15% of tokens masked, model predicts them). Excels at understanding and classification tasks (sentiment, NER, token classification). No autoregressive generation.

Input: [CLS] The quick [MASK] jumps [SEP] Task: Predict the masked token Output: "brown" (from contextual representations) Good for: Text classification, named entity recognition, semantic similarity

Decoder-Only (GPT-2, GPT-4, Llama)

Causal attention: each token attends only to past tokens. Trained with next-token prediction (given tokens 0..i, predict i+1). Naturally generates sequences autoregressively, one token at a time. All modern LLMs are decoder-only.

Input: "The quick brown" Attend: [position 0] → [0], [position 1] → [0,1], [position 2] → [0,1,2] Output: Logits for next token (predicting "fox")

Encoder-Decoder (T5, BART, Seq2Seq)

Encoder processes input bidirectionally; decoder attends causally to its own tokens and cross-attends to encoder outputs. Designed for tasks with separate input and output (translation, summarization, question answering).

Encoder input: "translate to French: Hello world" Encoder: [bidirectional attention] Decoder: [causal attention] + [cross-attention to encoder outputs] Decoder output: "Bonjour le monde"

Architecture	Attention type	Tasks	Examples
Encoder-only	Bidirectional	Classification, tagging	BERT, RoBERTa
Decoder-only	Causal	Language generation	GPT-4, Llama, Falcon
Encoder-decoder	Causal + cross	Seq2seq, translation	T5, BART, Pegasus

✓ Best practice: All modern LLMs (ChatGPT, Claude, Llama) are decoder-only. This single architecture dominates because generation is the most versatile task.

04 — Position

Positional Encoding

Transformers have no notion of sequence order — attention is permutation-invariant. To inject position information, we add positional encodings to token embeddings before feeding them to the first layer.

Absolute Positional Encodings

Sinusoidal (Vaswani, 2017): Position p, dimension d_model. Compute sin and cos at different frequencies. Learnable across context length. Still used in many models.

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Relative Positional Encodings: RoPE

RoPE (Rotary Position Embeddings, Su et al., 2021) applies rotation matrices in the query-key dot product. More sample-efficient than sinusoidal; naturally extrapolates to longer sequences. Used in Llama, PaLM, modern models.

# Rotate Q, K by position-dependent angle theta_m = base^(-2m/d) rotation_angle = pos * theta_m Q_rot = R(pos) * Q K_rot = R(pos) * K

Relative Positional Encodings: ALiBi

Attention with Linear Biases (ALiBi, Press et al., 2022) adds a position-dependent bias directly to attention logits before softmax. Extremely simple; extrapolates well. Used in some smaller models.

scores_ij = Q_i · K_j - |i - j| * alpha # Alpha is a small learned coefficient per head

💡 Why positional encoding matters: Without it, the transformer sees "cat sat the on" as equivalent to "the cat sat on" — position is lost. With it, the model learns that subject → verb → object order matters.

05 — Scaling

Efficiency Innovations: KV Cache, Flash Attention, and Friends

Transformers are powerful but expensive. Three key innovations dramatically reduce computation and memory.

KV Cache (Key-Value Caching)

During inference, we generate tokens one at a time. Without caching, we'd recompute K and V for all previous tokens on every generation step. KV cache stores K, V for all past tokens, reusing them. This cuts attention computation from O(n²) to O(n) during inference.

# Without cache: recompute attention for all tokens at each step for step in range(max_tokens): output = attention(Q_new, K_all, V_all) # K_all, V_all recomputed new_token = argmax(output) # With cache: store K, V and append kv_cache = {} for step in range(max_tokens): K_new, V_new = compute_once(token_new) kv_cache[step] = (K_new, V_new) output = attention(Q_new, kv_cache[:step+1], kv_cache[:step+1])

Flash Attention

Dao et al. (2022) observed that standard attention implementations read/write to slow GPU memory repeatedly. Flash Attention reorders computation to exploit fast SRAM (on-chip memory). Same result, 4–10x speedup, 2–5x less memory. Now standard in modern libraries.

✓ Impact: Flash Attention made long-context training practical. Reduced attention cost from main bottleneck to roughly equal with feed-forward layers.

Grouped-Query Attention (GQA)

Standard multi-head attention has 8–12 separate K, V projections (one per head), ballooning KV cache size. GQA uses a small number of shared K, V heads (e.g., 2) with multiple Q heads attending to them. Reduces KV cache by 8x; minimal quality loss.

Sliding Window Attention

Full attention is O(n²); for very long contexts, we can't afford it. Sliding window attention limits each token to attend to the last W tokens (e.g., W=4096). Reduces complexity to O(n*W). Used in Mistral, LLaMA 2 Long.

Technique	Impact	Trade-off
KV cache	4–100x speedup at inference	Memory for cache storage
Flash Attention	4–10x training speedup	Implementation complexity
Grouped-Query Attention	8x KV cache reduction	Minimal quality loss
Sliding window	Linear complexity for long context	Loses very long-range attention

⚠️ Production reality: Modern inference uses KV cache + Flash Attention + GQA together. Even so, a 70B model at 8K context takes significant time. Token generation is inherently sequential and memory-bound.

06 — Practice

Working Code Example

Let's load a pre-trained transformer and inspect its internal representations, attention weights, and how self-attention works in practice.

from transformers import AutoTokenizer, AutoModel import torch # Load a decoder-only model (GPT-style) model_id = "gpt2" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModel.from_pretrained(model_id) text = "The transformer architecture revolutionized NLP" inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs, output_attentions=True) print(f"Hidden state shape: {outputs.last_hidden_state.shape}") # Shape: [batch, sequence_length, hidden_size] # Inspect attention weights (last layer, first head) attn = outputs.attentions[-1][0, 0] # [seq_len, seq_len] print(f"Attention matrix shape: {attn.shape}") # Check token-level representations tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) print("Tokens:", tokens)

Output Breakdown

last_hidden_state: [1, 10, 768] — The final layer representations for each token. 768 is the hidden dimension of GPT-2. attentions: Tuple of 12 tensors (one per layer). Each is [batch, num_heads, seq_len, seq_len]. Shows which tokens the model attends to. tokens: ["The", "Ġtransformer", "Ġarchitecture", ...] — Note BPE tokenization breaks words; "Ġ" is space.

✓ Takeaway: Modern transformers are implemented in HuggingFace Transformers with a simple, consistent API across models. The attention mechanism is the core; everything else is engineering (layer norm, residuals, feed-forward, positional encoding).

07 — Deeper Dives

What to Explore Next

We've covered the architecture and efficiency. The ecosystem of transformer variants is vast. Here are four child concept pages that dive deeper:

Attention Mechanisms — Multi-head, causal, sparse variants

Dive into self-attention, cross-attention, different mask types, and modern variants like sparse attention and linear attention. Compare trade-offs.

→ Attention Mechanisms

Positional Encoding — RoPE, ALiBi, sinusoidal

Understand how transformers encode position, why it matters, and which methods extrapolate best. Explore interpolation vs. extrapolation.

→ Positional Encoding