Attention, feed-forward layers, positional encoding, and how the pieces create a language model
A Transformer is a stack of identical layers. Each layer contains two main sublayers: Multi-Head Attention and Feed-Forward Network. The input is token embeddings (vocabulary → d_model dimensional vector), and the output is a probability distribution over the vocabulary for the next token.
The following hyperparameters define model size and capacity:
| Model | d_model | Layers | Heads | d_ff | Params |
|---|---|---|---|---|---|
| GPT-2 Small | 768 | 12 | 12 | 3072 | 117M |
| Llama 2-7B | 4096 | 32 | 32 | 11008 | 7B |
| Llama 3-70B | 8192 | 80 | 64 | 28672 | 70B |
| GPT-4 (est) | 12288 | 96 | 96 | 49152 | ~1.8T (MoE) |
Each token attends to all other tokens in the sequence, asking "which other tokens are relevant to me?" This is the fundamental operation that gives Transformers their power and expressiveness.
For input X, three learned weight matrices project the input into Query (Q), Key (K), and Value (V) spaces:
Attention score is then: softmax(QK^T / sqrt(d_k)) → weighted sum of V
In decoder-only models (for generation), each token can only attend to previous tokens (causal masking). The upper triangle of the attention matrix is masked to -inf before softmax, preventing the model from looking into the future.
Run attention H times in parallel, each with different learned W_Q, W_K, W_V projections. Different heads learn different relationship types: one head might focus on syntax, another on coreference, another on positional relationships. This diversity improves representation quality.
Concatenate all head outputs and project back to d_model with W_O (output projection). This allows the model to combine diverse attention perspectives.
Modern models use Grouped Query Attention (GQA) instead of full Multi-Head Attention. GQA shares K/V heads across groups of Q heads, reducing KV cache memory by 4-8× with minimal quality loss. This is critical for long-context inference.
| Variant | K/V heads per Q head | KV cache size | Used by |
|---|---|---|---|
| MHA (Multi-Head Attention) | 1:1 ratio | 100% | GPT-2, early BERT |
| GQA (Grouped Query Attention, g=4) | 1:4 ratio | 25% | Llama 3, Mistral, Gemma |
| MQA (Multi-Query Attention) | 1:all ratio | 1/H | Falcon, early PaLM |
After attention, each token representation passes through a 2-layer FFN independently (no cross-token interaction). The FFN acts as a "knowledge store": research shows factual knowledge is stored in FFN weights, while attention routes information.
FFN = Linear(d_model → d_ff) → Activation → Linear(d_ff → d_model). The d_ff is typically 4× d_model, creating a bottleneck-then-expand pattern.
Modern models (Llama, PaLM, Gemini) use SwiGLU activation instead of ReLU or GELU. SwiGLU splits the intermediate into two paths — one through SiLU activation, multiply together. Empirically superior to older activations.
The FFN accounts for roughly 2/3 of a model's parameters. This is where most factual knowledge lives. When fine-tuning, research shows that LoRA targeting FFN layers often outperforms targeting only attention layers for knowledge-intensive tasks.
Attention is permutation-invariant — without position information, the model has no sense of word order. Positional encoding adds position information to embeddings so the model understands sequence structure.
| Method | Extrapolation | Relative positions | Used by |
|---|---|---|---|
| Sinusoidal | Poor | No | Original Transformer |
| Learned absolute | None (hard cap) | No | GPT-2, BERT |
| RoPE | Good (with extension) | Yes | Llama, Mistral, Gemini |
| ALiBi | Good | Yes | BLOOM, MPT |
RoPE can be extended beyond training context length using "RoPE scaling" techniques like YaRN or dynamic NTK scaling. This is how Llama 3's 8K training context was extended to 128K without major quality loss.
Deep networks require techniques to prevent gradient vanishing and activation explosion. Modern Transformers use residual connections and LayerNorm to maintain stable gradient flow and activation magnitudes throughout training.
Output = x + sublayer(x). Allows gradients to flow directly back to earlier layers without being multiplied by small values, preventing vanishing gradients in deep networks.
Normalize activations across features (not batch) at each sublayer. Prevents activation explosion and keeps training stable.
Modern models use Pre-LayerNorm: normalize input BEFORE sublayer. This improves training stability compared to the original post-norm design (normalize AFTER).
Root Mean Square Norm: simpler than LayerNorm (no mean centering). Used by Llama, Mistral — slightly faster and empirically equivalent.
Different task requirements call for different architectural patterns. Understanding when to use each is essential for model selection and fine-tuning design.
| Architecture | Masking | KV cache | Use cases | Examples |
|---|---|---|---|---|
| Encoder-only | Bidirectional | No | Embeddings, classification, NER | BERT, RoBERTa, DeBERTa |
| Decoder-only | Causal (left-to-right) | Yes | Text generation, chat, coding | GPT, Llama, Mistral, Claude |
| Encoder-decoder | Enc: bidir, Dec: causal | Yes | Translation, summarization, T2T | T5, FLAN-T5, mT5, Whisper |
Modern LLMs are almost exclusively decoder-only: simpler to scale, KV cache enables fast autoregressive generation, in-context learning works naturally. Encoder-only models (BERT-family) remain dominant for: embedding generation, token classification, and sentence similarity tasks.