LLM Alignment Techniques

Contents

What alignment means
SFT: the foundation
RLHF workflow
DPO: the simpler path
Constitutional AI
Method comparison
References

01 — Definition

What Alignment Means

A pretrained LLM predicts the next token — it's a completion engine, not an assistant. Alignment is the process of steering that completion engine toward being helpful, honest, and harmless. It happens after pretraining and after SFT.

Raw pretraining teaches a model to predict plausible continuations of text. But "plausible" includes offensive, factually wrong, or harmful content if it's statistically likely given the prompt. Alignment techniques layer preferences on top of that statistical foundation — telling the model what humans actually want.

💡 Key insight: Alignment is not safety, and safety is not alignment. Alignment steers behaviour toward human preferences. Safety prevents specific harms. They're complementary but distinct.

02 — Prerequisite

SFT: The Foundation

Supervised Fine-Tuning on demonstration data always comes first. You show the model (prompt, ideal response) pairs. It's the cheapest alignment step and gives the biggest quality jump. Every downstream alignment technique builds on a well-SFT'd model.

SFT shifts the base model's entire distribution toward assistant-like outputs. It teaches format, style, instruction following, and reasoning chains. Without good SFT, RLHF or DPO training becomes noisy — you're optimizing on top of a weak foundation.

💡 Never skip SFT. RLHF or DPO applied to a raw pretrained model is significantly less effective than applied to an SFT checkpoint. Start here always.

SFT Best Practices

Data quality: Even 10,000 high-quality SFT examples beat 100,000 noisy ones. Focus on diversity and clarity. Diversity: Cover instruction types, reasoning styles, and edge cases. Iteration: SFT early and often — each refinement compounds.

03 — Most Complete

RLHF Workflow

RLHF (Reinforcement Learning from Human Feedback) is the alignment method behind ChatGPT, Claude, and GPT-4. It maximizes a learned reward model via PPO (Proximal Policy Optimization) while penalizing divergence from the SFT checkpoint using KL divergence.

The RLHF Pipeline

Collect Human Preferences — the data

Annotators rank model responses (typically A vs B). This is expensive — usually 50–100 examples per prompt, across thousands of prompts, annotated by multiple annotators to ensure quality.

Clear preference definitions (helpfulness, factuality, safety)
Multiple annotators per example to measure agreement
Iterative calibration sessions to align annotator standards

Train a Reward Model — learns preferences

A separate model learns to score responses. Given (prompt, response A, response B), it predicts which humans prefer. This model becomes the ground truth during PPO training.

Usually a frozen base model + trainable head
Trained on pairwise cross-entropy loss
Accuracy on held-out test set signals quality

Run PPO Loop — optimize policy

Fine-tune the LLM to maximize reward model scores while staying close to the SFT model via KL divergence penalty. The penalty prevents reward hacking and distribution collapse.

Requires 3 models in VRAM: policy, reference, reward model
High compute cost — typically 3–4× SFT
Iterative refinement of generation quality

Iterate — close the loop

Collect new preference data on the updated model, retrain the reward model, run PPO again. Each iteration refines preferences and catches reward model drift.

Proportional gains diminish after 3–4 iterations
Refresh preference data quarterly
Monitor for reward hacking (unwanted shortcuts)

⚠️ RLHF requires 3 models in memory simultaneously during PPO: the policy, the reference model, and the reward model. This makes it expensive — typically 3–4× the cost of SFT alone. KL divergence penalty is critical: without it, the model learns shortcuts that maximize reward artificially rather than genuinely improving quality.

04 — Modern Simpler

DPO: The Simpler Path

Direct Preference Optimization (DPO) reformulates RLHF as a classification problem. Instead of training a reward model and running PPO, DPO directly optimizes the policy on preference pairs: given (prompt, chosen, rejected), update the policy to assign higher probability to chosen over rejected.

No reward model. No PPO. No reference model calls during training. Roughly 2× faster to implement and 2× faster to run. Empirically, DPO matches RLHF quality on many benchmarks.

DPO vs RLHF Pipeline

RLHF pipeline: DPO pipeline: SFT checkpoint SFT checkpoint → Reward model training → DPO training (preferred/rejected pairs) → PPO loop → Done (policy + ref + RM) → Final policy → Final policy Compute cost: 3–4× SFT Compute cost: ~1–1.5× SFT

DPO uses implicit reward modeling — the reward is hidden in the loss function. This simplicity comes with tradeoffs: DPO may be less stable than RLHF on very large models, and reward model evaluation is opaque. But for teams without massive annotation budgets or GPU fleets, DPO is often the pragmatic choice.

💡 When to use DPO: You have preference data, limited compute, and want to align quickly. Quality is ~95% of RLHF with significantly lower complexity.

Python · Direct Preference Optimization (DPO) training with TRL

from datasets import Dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load base model + reference model (frozen copy for KL penalty)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

# DPO dataset format: {prompt, chosen, rejected}
# 'chosen' = preferred response, 'rejected' = rejected response
dpo_data = [
    {
        "prompt": "Explain gradient descent.",
        "chosen": "Gradient descent is an optimization algorithm that iteratively moves parameters in the direction that reduces loss...",
        "rejected": "It's just math stuff that makes AI learn."
    },
    # ... more preference pairs
]
dataset = Dataset.from_list(dpo_data)

config = DPOConfig(
    output_dir="./dpo-model",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=5e-7,       # lower than SFT — fine-grained preference tuning
    beta=0.1,                 # KL penalty coefficient — higher = stay closer to ref model
    loss_type="sigmoid",      # DPO loss variant
    max_length=512,
    fp16=True,
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,  # TRL automatically creates frozen reference copy
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./dpo-model-final")

05 — Scalable

Constitutional AI

Constitutional AI (CAI) is Anthropic's approach to replacing human preference labels with AI-generated feedback. A set of principles — the "constitution" — guides a capable model to critique and revise its own outputs. The revised outputs become training data for alignment.

This scales without human annotation. Instead of paying annotators to rank responses, you craft a constitution and let the model self-improve. But it requires a capable enough base model to self-critique reliably — weak models will generate poor feedback.

The CAI Process

Constitution: Write explicit principles (e.g., "Be helpful. Be honest. Minimize harm.")
Critique: Ask a strong model to critique its own outputs using the constitution
Revision: The model revises outputs to address critiques
Finetune: Train on revised (better) outputs using SFT

This trades human effort for LLM compute. The constitution must be well-written and aligned with your values — vague principles lead to vague feedback. And the critique model must be strong enough to notice flaws and suggest improvements.

⚠️ CAI works best for stylistic and behavioral alignment. For factual correctness or domain-specific knowledge, human feedback is still necessary. You can't critique what you don't know.

Python · RLAIF: use LLM to generate preference labels (no human annotators)

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class PreferenceJudgment(BaseModel):
    preferred: str  # "A" or "B"
    reasoning: str
    confidence: float  # 0.0-1.0

JUDGE_SYSTEM = """You are an expert AI alignment judge. Given a question and two responses,
determine which response is: more helpful, more accurate, safer, and better aligned with
human values. Be consistent and objective."""

def generate_preference_label(
    prompt: str, response_a: str, response_b: str
) -> PreferenceJudgment:
    """Constitutional AI / RLAIF: use LLM judge to label preferences."""
    user = f"""Question: {prompt}

Response A:
{response_a}

Response B:
{response_b}

Which response is better? Reply with your preference (A or B), reasoning, and confidence."""

    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user", "content": user}
        ],
        response_format=PreferenceJudgment,
        temperature=0.0
    )
    return result.choices[0].message.parsed

def build_dpo_dataset(prompts: list[str], model_responses: list[tuple]) -> list[dict]:
    """Build DPO training data using RLAIF labels."""
    dataset = []
    for prompt, (resp_a, resp_b) in zip(prompts, model_responses):
        judgment = generate_preference_label(prompt, resp_a, resp_b)
        chosen, rejected = (resp_a, resp_b) if judgment.preferred == "A" else (resp_b, resp_a)
        if judgment.confidence >= 0.7:  # only use high-confidence labels
            dataset.append({"prompt": prompt, "chosen": chosen, "rejected": rejected})
    return dataset

06 — Tradeoffs

Method Comparison

Method	Human labels needed	Compute cost	Stability	Best when
SFT	Demonstrations	Low	High	Always — prerequisite
RLHF	Preference pairs + RM	High	Medium	Maximum quality, budget available
DPO	Preference pairs only	Medium	High	Simpler RLHF alternative
Constitutional AI	Minimal	Medium	Medium	Scaling without labellers

Decision Framework

🎯 Goal: Maximum Quality

Use RLHF if budget allows
Invest in diverse preference data
Run 3–4 iterations

⚡ Goal: Fast Iteration

Start with DPO
Requires preference data (existing or synthetic)
Faster feedback loop

💰 Goal: Minimize Labelling

Constitutional AI if model is strong
Write clear constitution
Use synthetic preferences for DPO

🔬 Goal: Research

Start with DPO (simpler, reproducible)
Build RLHF as baseline for comparison
Ablate reward model components

Cost vs Quality Tradeoff

RLHF achieves the highest quality but at high cost. DPO gets 90–95% of RLHF quality at half the compute. Constitutional AI sacrifices quality for annotation savings. Most teams should start with SFT + DPO, only moving to RLHF if quality plateaus and budget is available.

06 — Practice

Practical Alignment: When to Use Each Method

Choosing between SFT, RLHF, and DPO depends on your resources and goals. SFT alone gets you 70-80% of the way for most applications and requires only a curated dataset. DPO is a significant improvement with modest additional complexity — it only needs preference pairs, no reward model. Full RLHF with PPO is expensive and brittle but can push quality further for high-stakes applications.

A practical guideline: start with SFT on high-quality domain data, evaluate if the model meets your bar, then add DPO with human preference data if not. Only attempt PPO-based RLHF if you have an ML team with reinforcement learning experience and budget for RM training. Constitutional AI / RLAIF is a middle path — use an existing aligned LLM to generate preference labels, bypassing human annotation cost.

Method	Data Required	Compute	Quality Gain	When to Use
SFT	Curated examples	Low	Good baseline	Always — first step
DPO	Preference pairs	Low-Medium	+10–20%	After SFT, before RLHF
PPO/RLHF	RM training data	High	+15–30%	High-stakes, large budgets
RLAIF / CAI	LLM-generated feedback	Medium	+10–25%	No human annotation budget

07 — Further Reading

References

Academic Papers

Paper Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. OpenAI InstructGPT paper. arXiv:2203.02155. — arxiv:2203.02155 ↗
Paper Rafailov, R. et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290. — arxiv:2305.18290 ↗
Paper Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. arXiv:2212.08073. — arxiv:2212.08073 ↗
Paper Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. OpenAI. arXiv:1707.06347. — arxiv:1707.06347 ↗

Documentation & Guides

Docs HuggingFace TRL — DPO Trainer. huggingface.co/docs/trl ↗
Docs HuggingFace TRL — PPO Trainer (RLHF). huggingface.co/docs/trl ↗
Docs HuggingFace TRL — SFT Trainer. huggingface.co/docs/trl ↗
Guide Cohere — Introduction to RLHF. cohere.ai ↗

Practitioner Writing

Blog HuggingFace. (2024). RLHF: From Zero to ChatGPT. — huggingface.co/blog ↗
Blog Rafailov, R. (2023). DPO Explained: Simpler Reward Model Training. — medium.com ↗
Blog Anthropic. (2023). Constitutional AI: Harmlessness from AI Feedback. Blog post accompanying CAI paper. — anthropic.com ↗

LLM Alignment Techniques

What Alignment Means

SFT: The Foundation

SFT Best Practices

RLHF Workflow

The RLHF Pipeline

Collect Human Preferences — the data

Train a Reward Model — learns preferences

Run PPO Loop — optimize policy

Iterate — close the loop

DPO: The Simpler Path

DPO vs RLHF Pipeline

Constitutional AI

The CAI Process

Method Comparison

Decision Framework

🎯 Goal: Maximum Quality

⚡ Goal: Fast Iteration

💰 Goal: Minimize Labelling

🔬 Goal: Research

Cost vs Quality Tradeoff

Practical Alignment: When to Use Each Method

References

Related concepts