Fine-tuning · Concept guide

LLM Alignment Techniques

RLHF, DPO, and Constitutional AI compared — how each shapes model behaviour, what it costs, and when to use it.

3 main approaches
~2× DPO speed vs RLHF
KL divergence keeps models grounded
SFT first always
Contents
  1. What alignment means
  2. SFT: the foundation
  3. RLHF workflow
  4. DPO: the simpler path
  5. Constitutional AI
  6. Method comparison
  7. References
00 — Core

How Attention Works

Attention is the mechanism that lets a transformer weigh how much each token should influence every other token. The core operation: for each token, compute a query vector, then dot it against every other token's key vector to get an attention score. Scale by √d_k to prevent vanishing gradients, apply softmax to get a probability distribution, then use those weights to sum the value vectors. The result is a context-aware representation of each token.

Modern LLMs use Multi-Head Attention (MHA): run h independent attention heads in parallel, each learning different relationship patterns (syntactic, semantic, positional), then concatenate. Variants like Grouped Query Attention (GQA) and Multi-Query Attention (MQA) reduce the KV cache by sharing key/value heads across multiple query heads.

Python · Scaled dot-product attention from scratch in PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.0):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        # Single projection matrix for efficiency: Q, K, V concatenated
        self.qkv_proj = nn.Linear(d_model, 3 * d_model, bias=False)
        self.out_proj  = nn.Linear(d_model, d_model, bias=False)
        self.dropout   = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor,
                mask: torch.Tensor = None) -> torch.Tensor:
        B, T, C = x.shape  # batch, seq_len, d_model

        # Project to Q, K, V
        qkv = self.qkv_proj(x)                    # (B, T, 3*C)
        Q, K, V = qkv.split(C, dim=-1)            # each: (B, T, C)

        # Reshape for multi-head: (B, n_heads, T, d_k)
        def split_heads(t):
            return t.reshape(B, T, self.n_heads, self.d_k).transpose(1, 2)
        Q, K, V = split_heads(Q), split_heads(K), split_heads(V)

        # Scaled dot-product attention
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        weights = F.softmax(scores, dim=-1)
        weights = self.dropout(weights)

        # Weighted sum of values
        out = weights @ V                          # (B, n_heads, T, d_k)
        out = out.transpose(1, 2).reshape(B, T, C) # (B, T, C)
        return self.out_proj(out)

# Test
attn = MultiHeadAttention(d_model=512, n_heads=8)
x = torch.randn(2, 16, 512)   # batch=2, seq_len=16
out = attn(x)
print(f"Input: {x.shape} → Output: {out.shape}")  # same shape
00b — Variants

Attention Variants: MHA, MQA, GQA, Flash

The original Multi-Head Attention (MHA) stores a separate K and V matrix per head, which grows the KV cache linearly with the number of heads. Two efficient variants address this: Multi-Query Attention (MQA) uses a single shared K/V across all heads, drastically reducing cache size. Grouped Query Attention (GQA) — used in Llama 3, Mistral, and Gemma — is a middle ground: groups of query heads share a single K/V pair.

FlashAttention is an algorithmic optimization (not an architectural change) that rewrites the attention computation to be IO-aware — tiling the operation to fit in fast SRAM rather than repeatedly reading from HBM. FlashAttention 2 achieves ~2× the speed of standard attention on modern GPUs without changing the mathematical result. Most production LLM frameworks (vLLM, TGI) use FlashAttention by default.

VariantKV HeadsKV Cache SizeQuality vs MHAUsed In
MHA= num_headsBaselineBaselineGPT-2, BERT
MQA1 (shared)num_heads× smallerSlight dropPaLM, Falcon
GQAn_groups (e.g. 8)8× smaller vs MHANear-MHALlama 3, Mistral, Gemma 2
FlashAttentionN/A (IO rewrite)Same as baseIdenticalAll modern serving
01 — Definition

What Alignment Means

A pretrained LLM predicts the next token — it's a completion engine, not an assistant. Alignment is the process of steering that completion engine toward being helpful, honest, and harmless. It happens after pretraining and after SFT.

Raw pretraining teaches a model to predict plausible continuations of text. But "plausible" includes offensive, factually wrong, or harmful content if it's statistically likely given the prompt. Alignment techniques layer preferences on top of that statistical foundation — telling the model what humans actually want.

💡 Key insight: Alignment is not safety, and safety is not alignment. Alignment steers behaviour toward human preferences. Safety prevents specific harms. They're complementary but distinct.
02 — Prerequisite

SFT: The Foundation

Supervised Fine-Tuning on demonstration data always comes first. You show the model (prompt, ideal response) pairs. It's the cheapest alignment step and gives the biggest quality jump. Every downstream alignment technique builds on a well-SFT'd model.

SFT shifts the base model's entire distribution toward assistant-like outputs. It teaches format, style, instruction following, and reasoning chains. Without good SFT, RLHF or DPO training becomes noisy — you're optimizing on top of a weak foundation.

💡 Never skip SFT. RLHF or DPO applied to a raw pretrained model is significantly less effective than applied to an SFT checkpoint. Start here always.

SFT Best Practices

Data quality: Even 10,000 high-quality SFT examples beat 100,000 noisy ones. Focus on diversity and clarity. Diversity: Cover instruction types, reasoning styles, and edge cases. Iteration: SFT early and often — each refinement compounds.

03 — Most Complete

RLHF Workflow

RLHF (Reinforcement Learning from Human Feedback) is the alignment method behind ChatGPT, Claude, and GPT-4. It maximizes a learned reward model via PPO (Proximal Policy Optimization) while penalizing divergence from the SFT checkpoint using KL divergence.

The RLHF Pipeline

1

Collect Human Preferences — the data

Annotators rank model responses (typically A vs B). This is expensive — usually 50–100 examples per prompt, across thousands of prompts, annotated by multiple annotators to ensure quality.

  • Clear preference definitions (helpfulness, factuality, safety)
  • Multiple annotators per example to measure agreement
  • Iterative calibration sessions to align annotator standards
2

Train a Reward Model — learns preferences

A separate model learns to score responses. Given (prompt, response A, response B), it predicts which humans prefer. This model becomes the ground truth during PPO training.

  • Usually a frozen base model + trainable head
  • Trained on pairwise cross-entropy loss
  • Accuracy on held-out test set signals quality
3

Run PPO Loop — optimize policy

Fine-tune the LLM to maximize reward model scores while staying close to the SFT model via KL divergence penalty. The penalty prevents reward hacking and distribution collapse.

  • Requires 3 models in VRAM: policy, reference, reward model
  • High compute cost — typically 3–4× SFT
  • Iterative refinement of generation quality
4

Iterate — close the loop

Collect new preference data on the updated model, retrain the reward model, run PPO again. Each iteration refines preferences and catches reward model drift.

  • Proportional gains diminish after 3–4 iterations
  • Refresh preference data quarterly
  • Monitor for reward hacking (unwanted shortcuts)
⚠️ RLHF requires 3 models in memory simultaneously during PPO: the policy, the reference model, and the reward model. This makes it expensive — typically 3–4× the cost of SFT alone. KL divergence penalty is critical: without it, the model learns shortcuts that maximize reward artificially rather than genuinely improving quality.
04 — Modern Simpler

DPO: The Simpler Path

Direct Preference Optimization (DPO) reformulates RLHF as a classification problem. Instead of training a reward model and running PPO, DPO directly optimizes the policy on preference pairs: given (prompt, chosen, rejected), update the policy to assign higher probability to chosen over rejected.

No reward model. No PPO. No reference model calls during training. Roughly 2× faster to implement and 2× faster to run. Empirically, DPO matches RLHF quality on many benchmarks.

DPO vs RLHF Pipeline

RLHF pipeline: DPO pipeline: SFT checkpoint SFT checkpoint → Reward model training → DPO training (preferred/rejected pairs) → PPO loop → Done (policy + ref + RM) → Final policy → Final policy Compute cost: 3–4× SFT Compute cost: ~1–1.5× SFT

DPO uses implicit reward modeling — the reward is hidden in the loss function. This simplicity comes with tradeoffs: DPO may be less stable than RLHF on very large models, and reward model evaluation is opaque. But for teams without massive annotation budgets or GPU fleets, DPO is often the pragmatic choice.

💡 When to use DPO: You have preference data, limited compute, and want to align quickly. Quality is ~95% of RLHF with significantly lower complexity.
05 — Scalable

Constitutional AI

Constitutional AI (CAI) is Anthropic's approach to replacing human preference labels with AI-generated feedback. A set of principles — the "constitution" — guides a capable model to critique and revise its own outputs. The revised outputs become training data for alignment.

This scales without human annotation. Instead of paying annotators to rank responses, you craft a constitution and let the model self-improve. But it requires a capable enough base model to self-critique reliably — weak models will generate poor feedback.

The CAI Process

  1. Constitution: Write explicit principles (e.g., "Be helpful. Be honest. Minimize harm.")
  2. Critique: Ask a strong model to critique its own outputs using the constitution
  3. Revision: The model revises outputs to address critiques
  4. Finetune: Train on revised (better) outputs using SFT

This trades human effort for LLM compute. The constitution must be well-written and aligned with your values — vague principles lead to vague feedback. And the critique model must be strong enough to notice flaws and suggest improvements.

⚠️ CAI works best for stylistic and behavioral alignment. For factual correctness or domain-specific knowledge, human feedback is still necessary. You can't critique what you don't know.
06 — Tradeoffs

Method Comparison

MethodHuman labels neededCompute costStabilityBest when
SFTDemonstrationsLowHighAlways — prerequisite
RLHFPreference pairs + RMHighMediumMaximum quality, budget available
DPOPreference pairs onlyMediumHighSimpler RLHF alternative
Constitutional AIMinimalMediumMediumScaling without labellers

Decision Framework

🎯 Goal: Maximum Quality

  • Use RLHF if budget allows
  • Invest in diverse preference data
  • Run 3–4 iterations

Goal: Fast Iteration

  • Start with DPO
  • Requires preference data (existing or synthetic)
  • Faster feedback loop

💰 Goal: Minimize Labelling

  • Constitutional AI if model is strong
  • Write clear constitution
  • Use synthetic preferences for DPO

🔬 Goal: Research

  • Start with DPO (simpler, reproducible)
  • Build RLHF as baseline for comparison
  • Ablate reward model components

Cost vs Quality Tradeoff

RLHF achieves the highest quality but at high cost. DPO gets 90–95% of RLHF quality at half the compute. Constitutional AI sacrifices quality for annotation savings. Most teams should start with SFT + DPO, only moving to RLHF if quality plateaus and budget is available.

07 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing