Reinforcement Learning from Human Feedback: the training approach that transformed language models into helpful, harmless assistants by learning from human preference judgments.
Supervised fine-tuning (SFT)—where you fine-tune a base model on a dataset of (prompt, ideal response) pairs—has a fundamental ceiling. Human annotators can write down ideal responses, but this static dataset cannot capture the infinite diversity of real-world use cases. Moreover, the "ideal response" is inherently subjective: some users prefer conciseness, others want detail; some value accuracy, others creativity.
RLHF solves this by treating human preferences as a training signal. Instead of labeling individual responses as "correct," annotators compare pairs of model outputs and indicate which is better. This preference data is then used to train a reward model that predicts "how much will humans prefer this completion?" Finally, a reinforcement learning algorithm (typically Proximal Policy Optimization, or PPO) tunes the language model's policy to maximize expected human-predicted reward.
The genius of RLHF lies in its leverage: a relatively small preference dataset (tens of thousands of comparisons) can train a reward model, which then guides the LLM's learning across millions of examples. This indirect approach—learning to predict preferences rather than memorizing ideal responses—proved transformative. ChatGPT's release in November 2022 demonstrated RLHF's power: the combination of an instruction-tuned base model plus RLHF produced behavior that felt more natural and helpful than SFT-only models.
Phase 1: Supervised Fine-Tuning (SFT)
Start with a pre-trained base model and fine-tune it on a small supervised dataset of high-quality (prompt, response) pairs, often written by expert annotators. This shifts the model's behavior from pure language modeling (predicting the next token) toward instruction-following and helpful responses. SFT is quick—usually a few hours on a single GPU—and establishes a reasonable baseline. Without SFT, the reward model in Phase 2 sees chaotic outputs and struggles to learn meaningful preferences.
Phase 2: Reward Model Training
Collect human preferences: for thousands of prompts, generate multiple candidate responses from the SFT model (or a set of baseline models), and have humans rank or compare them. This creates a dataset of preference pairs (response_A, response_B, preference: A is better). Train a new model (sharing the same architecture as the base model, but with a scalar reward head) to predict: given a prompt and response, what is the reward score? The reward model learns to align with human preferences via supervised learning on preference data.
Phase 3: Reinforcement Learning (PPO)
Use the reward model to provide a learning signal. For each prompt, generate one or more candidate responses from the fine-tuned policy. Compute the reward for each response. Apply PPO (or another RL algorithm) to update the policy to increase expected reward. Crucially, maintain a KL divergence penalty so the policy doesn't drift too far from the SFT model (to prevent degenerate solutions and preserve in-distribution behavior).
The reward model is the linchpin of RLHF. Its job is to assign a scalar score reflecting human preference. Training it well is non-trivial.
Data Collection: Present human raters with a prompt and two model-generated responses (A and B). Raters indicate which is better (or "tie"). Typically, responses are sampled from the SFT model to ensure the RM sees the model's actual error modes. Higher diversity in prompts and response lengths improves RM generalization. Most successful projects collect 100k to 500k preference pairs.
Bradley-Terry Model: The standard approach is the Bradley-Terry model, which models pairwise comparisons as logistic: P(A > B) = sigmoid(r(A) - r(B)), where r(x) is the reward. The cross-entropy loss is: L = -log(sigmoid(r(A) - r(B))). This symmetric loss encourages the RM to score preferred responses higher and dis-preferred responses lower. It's robust to annotation noise: even if a few labels are flipped, the overall signal is preserved because the model learns the signal from thousands of pairs.
RM Validation: Hold out a test set of preferences and measure accuracy: does the RM correctly rank held-out preference pairs? Typical RM accuracy is 70-80%. Accuracy <65% suggests the RM hasn't converged or the annotations are too noisy. Accuracy >90% might indicate the RM has memorized or that the preference signal is trivial.
Calibration: Ensure the RM's reward distribution is reasonable. If all responses score >100 or <-100, the RM has saturated and won't provide useful gradients to PPO. Monitor reward statistics (mean, std) during training and adjust the learning rate or loss scaling if needed.
PPO (Proximal Policy Optimization) is a policy gradient algorithm that updates the model to maximize expected reward while staying close to the reference (SFT) model. The PPO loss has three components:
1. Reward Signal: The RM provides a scalar reward r for each generated response. Typically, the reward is clipped (e.g., between -1 and 1) to avoid extreme values that destabilize learning.
2. KL Divergence Penalty: The policy (language model) is encouraged to stay close to the reference model (SFT model) via a KL divergence penalty: KL(policy || reference). This prevents the policy from learning adversarial, degenerate outputs that game the RM. The penalty weight β is tuned (typically 0.01 to 0.1).
3. Advantage Estimation: Compute the advantage A(s) = reward - V(prompt), where V is a value function (baseline) that predicts expected reward given the prompt. This reduces variance in the gradient estimate. The value function is often a head on the policy model, trained to predict mean rewards.
Training Loop Dynamics: In each epoch, collect a batch of prompts, sample responses, compute rewards and advantages, and update both the policy and value function. PPO training is more unstable than SFT: learning rates, KL weight, batch size, and advantage estimation method all matter. Most projects do 1-3 PPO epochs and early-stop based on held-out reward. Training typically lasts 2-4 days on a single large GPU or scales well on multi-GPU.
Reward Hacking: The policy can find adversarial outputs that score high on the RM but are actually bad. For example, if the RM is trained to prefer long responses, the policy might generate repetitive text. Mitigation: ensure diversity in the RM training data, use multiple RM heads (ensemble), and include a length penalty in the reward signal. Some labs train RM on "hard negatives"—responses that are technically wrong but score high on surface metrics.
Mode Collapse: The policy converges to generating the same response for all prompts. This occurs if the RM signal is weak or if the KL penalty is too loose. Mitigation: increase entropy regularization (encourage the policy to sample diverse tokens) or reduce KL weight incrementally. Monitor mode collapse by tracking unique tokens and response diversity.
Compute Cost: RLHF is expensive. Generating responses (sampling from the policy), scoring them (forward pass through RM), and running PPO require significant compute. Scaling to large models (100B+ parameters) becomes prohibitive. Mitigation: use inference optimization (quantization, KV caching), batch samples across multiple GPUs, or use distributed PPO algorithms.
Annotation Quality: Human annotators have biases, disagreements, and fatigue. Inter-rater agreement is typically 60-70%; some pairs are ambiguous. Low-quality annotations corrupt the RM. Mitigation: use multiple raters per pair and aggregate via voting, train raters on rubrics, and monitor annotation consistency over time. Consider using AI judges (e.g., a strong LLM) to augment human feedback.
RLHF is powerful but not the only path. Alternative alignment approaches have emerged, each with tradeoffs:
| Method | Key Idea | Pros | Cons |
|---|---|---|---|
| RLHF | Train RM on preferences; use PPO | Proven at scale; leverages human judgment; indirect learning | Expensive; complex pipeline; RM reward hacking |
| DPO | Optimize policy directly on preference pairs, no RM | Simpler; fewer hyperparrams; faster; stable | Needs larger preference dataset; less explored at very large scale |
| RLAIF | Use AI judges (strong LLM) instead of humans for preference | Scales without human annotation; cheaper; faster iteration | Risk of circular reasoning; AI judge biases propagate |
| Constitutional AI | Finetune on critiques from principle-based LLM judge | Interpretable; aligns to explicit values; low annotation cost | Critique generation can be weak; depends on constitution phrasing |
In practice, many labs use a hybrid: start with RLHF to achieve good performance, then switch to DPO (cheaper, faster) for iterative refinement. RLAIF and Constitutional AI are exciting for scaling—they reduce human annotation burden—but they introduce new risks (e.g., if the AI judge is biased, the bias propagates).
The Hugging Face TRL (Transformers Reinforcement Learning) library is the go-to toolkit for RLHF and variants. It provides SFTTrainer, RewardTrainer, and PPOTrainer that abstract away complexity.
Hyperparameter Tuning: Key hyperparameters include SFT learning rate (1e-4 to 5e-4), RM learning rate (1e-5 to 1e-4), PPO learning rate (1e-5 to 1e-6), KL weight (0.01 to 0.1), and PPO epochs (1-4). Start conservative and scale up. Use a held-out validation set to pick the final checkpoint.
Hardware & Cost: Training a 7B model end-to-end (SFT + RM + PPO) on a single A100 GPU typically takes 2-4 weeks. For larger models (70B+), use multi-GPU or multi-node distributed training. Cloud providers (Lambda Labs, Modal, RunPod) offer GPUs by the hour; budget $1k-10k per full RLHF run depending on model size and dataset.
Successful RLHF deployments share a common set of careful choices that rarely appear in papers. Annotator selection matters enormously: annotators with domain expertise (e.g. medical professionals for medical RLHF) produce reward models that generalise better than crowdworkers whose preferences are noisier. Budget 20–30% of total annotation cost on annotator calibration — showing agreed examples and measuring inter-annotator agreement before starting the main labelling run.
Reward model architecture should mirror the policy model in scale. A reward model that is two size classes smaller than the policy will underfit to subtle quality differences the policy exploits. Train the reward model with a held-out "golden set" of 200–500 examples judged by senior annotators; use this set as a calibration probe at the end of each RM training run to detect over-fitting before it propagates to PPO.
For PPO, clip the KL penalty coefficient (beta) between 0.01 and 0.1. Lower beta allows more deviation from the SFT policy, which can improve helpfulness but risks reward hacking. Monitor the KL divergence metric directly during training; if it exceeds 20 nats, the policy is drifting too far and should be pulled back. A simple proxy for reward hacking: if the reward model score is rising but human preference ratings on a live sample are flat or declining, the policy has found a shortcut the reward model rewards but humans dislike.