Alignment & Fine-tuning

RLHF in Depth

Reinforcement Learning from Human Feedback: the training approach that transformed language models into helpful, harmless assistants by learning from human preference judgments.

PPO
Core Algorithm
3
Training Phases
~4x
Quality Lift vs SFT

Table of Contents

  1. Why RLHF
  2. The Three Phases
  3. Reward Model Training
  4. PPO Fine-tuning Loop
  1. Practical Challenges
  2. RLHF vs Alternatives
  3. Tools & Implementation
SECTION 01

Why RLHF

Supervised fine-tuning (SFT)—where you fine-tune a base model on a dataset of (prompt, ideal response) pairs—has a fundamental ceiling. Human annotators can write down ideal responses, but this static dataset cannot capture the infinite diversity of real-world use cases. Moreover, the "ideal response" is inherently subjective: some users prefer conciseness, others want detail; some value accuracy, others creativity.

RLHF solves this by treating human preferences as a training signal. Instead of labeling individual responses as "correct," annotators compare pairs of model outputs and indicate which is better. This preference data is then used to train a reward model that predicts "how much will humans prefer this completion?" Finally, a reinforcement learning algorithm (typically Proximal Policy Optimization, or PPO) tunes the language model's policy to maximize expected human-predicted reward.

The genius of RLHF lies in its leverage: a relatively small preference dataset (tens of thousands of comparisons) can train a reward model, which then guides the LLM's learning across millions of examples. This indirect approach—learning to predict preferences rather than memorizing ideal responses—proved transformative. ChatGPT's release in November 2022 demonstrated RLHF's power: the combination of an instruction-tuned base model plus RLHF produced behavior that felt more natural and helpful than SFT-only models.

The Preference Signal RLHF doesn't require "ground truth" answers—just comparisons. If two responses are nearly equally good, human raters indicate that. The reward model learns to interpolate and generalize, inferring quality even for cases unseen during annotation.
SECTION 02

The Three Phases of RLHF

Phase 1: Supervised Fine-Tuning (SFT)
Start with a pre-trained base model and fine-tune it on a small supervised dataset of high-quality (prompt, response) pairs, often written by expert annotators. This shifts the model's behavior from pure language modeling (predicting the next token) toward instruction-following and helpful responses. SFT is quick—usually a few hours on a single GPU—and establishes a reasonable baseline. Without SFT, the reward model in Phase 2 sees chaotic outputs and struggles to learn meaningful preferences.

Phase 2: Reward Model Training
Collect human preferences: for thousands of prompts, generate multiple candidate responses from the SFT model (or a set of baseline models), and have humans rank or compare them. This creates a dataset of preference pairs (response_A, response_B, preference: A is better). Train a new model (sharing the same architecture as the base model, but with a scalar reward head) to predict: given a prompt and response, what is the reward score? The reward model learns to align with human preferences via supervised learning on preference data.

Phase 3: Reinforcement Learning (PPO)
Use the reward model to provide a learning signal. For each prompt, generate one or more candidate responses from the fine-tuned policy. Compute the reward for each response. Apply PPO (or another RL algorithm) to update the policy to increase expected reward. Crucially, maintain a KL divergence penalty so the policy doesn't drift too far from the SFT model (to prevent degenerate solutions and preserve in-distribution behavior).

RLHF Pipeline Pseudocode: # Phase 1: SFT sft_model = finetune( base_model, dataset=sft_pairs, # 10k-50k (prompt, ideal_response) epochs=1-3 ) # Phase 2: Reward Model Training for prompt in preference_dataset: # Humans compared model outputs; recorded: A > B reward_A = reward_model(prompt, response_A) reward_B = reward_model(prompt, response_B) # Loss: maximize log(sigmoid(reward_A - reward_B)) loss += -log(sigmoid(reward_A - reward_B)) reward_model.update() # Phase 3: PPO Loop for epoch in range(num_epochs): for prompt in prompt_batch: # Sample from policy response = sft_model.sample(prompt) reward = reward_model(prompt, response) # Compute advantage estimate (A3C or GAE) advantage = reward - baseline(prompt) # PPO loss with KL penalty log_prob = sft_model.log_prob(response | prompt) loss = -advantage * log_prob + kl_weight * KL(policy || ref_model) policy.update(loss)
Why Three Phases? SFT warm-starts the policy in a reasonable region; RM training is parallelizable and economical (small dataset); PPO refinement is efficient because the RM provides dense gradients. Skipping SFT makes RM training much harder.
SECTION 03

Reward Model Training in Detail

The reward model is the linchpin of RLHF. Its job is to assign a scalar score reflecting human preference. Training it well is non-trivial.

Data Collection: Present human raters with a prompt and two model-generated responses (A and B). Raters indicate which is better (or "tie"). Typically, responses are sampled from the SFT model to ensure the RM sees the model's actual error modes. Higher diversity in prompts and response lengths improves RM generalization. Most successful projects collect 100k to 500k preference pairs.

Bradley-Terry Model: The standard approach is the Bradley-Terry model, which models pairwise comparisons as logistic: P(A > B) = sigmoid(r(A) - r(B)), where r(x) is the reward. The cross-entropy loss is: L = -log(sigmoid(r(A) - r(B))). This symmetric loss encourages the RM to score preferred responses higher and dis-preferred responses lower. It's robust to annotation noise: even if a few labels are flipped, the overall signal is preserved because the model learns the signal from thousands of pairs.

Bradley-Terry Loss and Gradient: import torch import torch.nn.functional as F # responses: (batch_size, seq_len) # preferences: (batch_size,) with 1 if A > B, 0 if B > A response_A, response_B = responses[:, 0], responses[:, 1] rewards_A = reward_model(response_A) # shape: (batch_size, 1) rewards_B = reward_model(response_B) # Bradley-Terry loss preference_logits = rewards_A - rewards_B loss = F.binary_cross_entropy_with_logits( preference_logits, preferences.float() ) # Optimization optimizer.zero_grad() loss.backward() optimizer.step() # Result: high-preferred responses get high reward, low-preferred get low reward

RM Validation: Hold out a test set of preferences and measure accuracy: does the RM correctly rank held-out preference pairs? Typical RM accuracy is 70-80%. Accuracy <65% suggests the RM hasn't converged or the annotations are too noisy. Accuracy >90% might indicate the RM has memorized or that the preference signal is trivial.

Calibration: Ensure the RM's reward distribution is reasonable. If all responses score >100 or <-100, the RM has saturated and won't provide useful gradients to PPO. Monitor reward statistics (mean, std) during training and adjust the learning rate or loss scaling if needed.

Common RM Pitfall: Reward Collapse If the RM always predicts the same reward regardless of response, PPO has no signal. This happens if the preference dataset is too skewed (e.g., 95% of comparisons favor the same response). Ensure diverse preference data and check RM calibration early.
SECTION 04

PPO Fine-tuning Loop

PPO (Proximal Policy Optimization) is a policy gradient algorithm that updates the model to maximize expected reward while staying close to the reference (SFT) model. The PPO loss has three components:

1. Reward Signal: The RM provides a scalar reward r for each generated response. Typically, the reward is clipped (e.g., between -1 and 1) to avoid extreme values that destabilize learning.

2. KL Divergence Penalty: The policy (language model) is encouraged to stay close to the reference model (SFT model) via a KL divergence penalty: KL(policy || reference). This prevents the policy from learning adversarial, degenerate outputs that game the RM. The penalty weight β is tuned (typically 0.01 to 0.1).

3. Advantage Estimation: Compute the advantage A(s) = reward - V(prompt), where V is a value function (baseline) that predicts expected reward given the prompt. This reduces variance in the gradient estimate. The value function is often a head on the policy model, trained to predict mean rewards.

PPO Loss with KL Penalty: # Generate responses under current policy prompts = batch["prompts"] # (batch_size,) responses = policy.sample(prompts) # (batch_size, seq_len) # Compute rewards and log probabilities rewards = reward_model(prompts, responses) log_probs = policy.log_prob(responses | prompts) # Baseline (value function) values = value_model(prompts) # Advantage estimation advantages = rewards - values # (batch_size,) advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) # Reference model log probs (for KL) ref_log_probs = ref_model.log_prob(responses | prompts) # PPO Loss: policy gradient + KL penalty policy_loss = -advantages * log_probs kl_loss = ref_log_probs - log_probs # KL divergence total_loss = policy_loss.mean() + beta * kl_loss.mean() # Value function MSE loss value_loss = F.mse_loss(values, rewards) # Total loss loss = total_loss + value_loss optimizer.zero_grad() loss.backward() optimizer.step()

Training Loop Dynamics: In each epoch, collect a batch of prompts, sample responses, compute rewards and advantages, and update both the policy and value function. PPO training is more unstable than SFT: learning rates, KL weight, batch size, and advantage estimation method all matter. Most projects do 1-3 PPO epochs and early-stop based on held-out reward. Training typically lasts 2-4 days on a single large GPU or scales well on multi-GPU.

KL Weight Tuning Start with β = 0.05 and monitor KL divergence during training. If KL stays <0.5 and reward increases, you're in the good regime. If KL jumps >2.0, increase β. If KL <0.1 and reward plateaus, decrease β to allow more exploration.
SECTION 05

Practical Challenges in RLHF

Reward Hacking: The policy can find adversarial outputs that score high on the RM but are actually bad. For example, if the RM is trained to prefer long responses, the policy might generate repetitive text. Mitigation: ensure diversity in the RM training data, use multiple RM heads (ensemble), and include a length penalty in the reward signal. Some labs train RM on "hard negatives"—responses that are technically wrong but score high on surface metrics.

Mode Collapse: The policy converges to generating the same response for all prompts. This occurs if the RM signal is weak or if the KL penalty is too loose. Mitigation: increase entropy regularization (encourage the policy to sample diverse tokens) or reduce KL weight incrementally. Monitor mode collapse by tracking unique tokens and response diversity.

Compute Cost: RLHF is expensive. Generating responses (sampling from the policy), scoring them (forward pass through RM), and running PPO require significant compute. Scaling to large models (100B+ parameters) becomes prohibitive. Mitigation: use inference optimization (quantization, KV caching), batch samples across multiple GPUs, or use distributed PPO algorithms.

Annotation Quality: Human annotators have biases, disagreements, and fatigue. Inter-rater agreement is typically 60-70%; some pairs are ambiguous. Low-quality annotations corrupt the RM. Mitigation: use multiple raters per pair and aggregate via voting, train raters on rubrics, and monitor annotation consistency over time. Consider using AI judges (e.g., a strong LLM) to augment human feedback.

Reward Model Overfitting The RM can overfit to the SFT model's outputs, learning spurious features instead of genuine preference signals. If you then use PPO to push the policy away from the SFT model, the RM's predictions become unreliable. Mitigation: validate the RM on diverse, high-quality held-out preference data; use a diverse set of reference models in Phase 1.
SECTION 06

RLHF vs Alternatives

RLHF is powerful but not the only path. Alternative alignment approaches have emerged, each with tradeoffs:

Method Key Idea Pros Cons
RLHF Train RM on preferences; use PPO Proven at scale; leverages human judgment; indirect learning Expensive; complex pipeline; RM reward hacking
DPO Optimize policy directly on preference pairs, no RM Simpler; fewer hyperparrams; faster; stable Needs larger preference dataset; less explored at very large scale
RLAIF Use AI judges (strong LLM) instead of humans for preference Scales without human annotation; cheaper; faster iteration Risk of circular reasoning; AI judge biases propagate
Constitutional AI Finetune on critiques from principle-based LLM judge Interpretable; aligns to explicit values; low annotation cost Critique generation can be weak; depends on constitution phrasing

In practice, many labs use a hybrid: start with RLHF to achieve good performance, then switch to DPO (cheaper, faster) for iterative refinement. RLAIF and Constitutional AI are exciting for scaling—they reduce human annotation burden—but they introduce new risks (e.g., if the AI judge is biased, the bias propagates).

Modern Trends The field is moving toward simpler, more stable alternatives to RLHF. DPO (2023) demonstrated that you don't need an explicit RM. Recent work on online DPO and iterative preference optimization suggest even simpler approaches. However, RLHF remains the proven path to state-of-the-art quality at scale.
SECTION 07

Tools & Implementation

The Hugging Face TRL (Transformers Reinforcement Learning) library is the go-to toolkit for RLHF and variants. It provides SFTTrainer, RewardTrainer, and PPOTrainer that abstract away complexity.

SFT with TRL: from trl import SFTTrainer from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "meta-llama/Llama-2-7b" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Load SFT dataset: list of {"text": "prompt\nresponse"} sft_dataset = load_dataset( "json", data_files="sft_data.jsonl" ) trainer = SFTTrainer( model, train_dataset=sft_dataset["train"], args=SFTConfig( output_dir="./sft_output", learning_rate=2e-4, max_steps=10000, per_device_train_batch_size=4, ), tokenizer=tokenizer, ) trainer.train()
Reward Model Training with TRL: from trl import RewardTrainer import torch # Dataset: list of { # "input_ids_chosen": [...], # "attention_mask_chosen": [...], # "input_ids_rejected": [...], # "attention_mask_rejected": [...] # } rm_dataset = load_dataset( "json", data_files="preferences.jsonl" ) # Reward model: base model with scalar head rm_model = AutoModelForCausalLM.from_pretrained( "path/to/sft_model" ) # Add reward head (add linear layer on top of last hidden state) trainer = RewardTrainer( model=rm_model, train_dataset=rm_dataset["train"], args=RewardConfig( output_dir="./rm_output", learning_rate=1e-4, max_steps=5000, per_device_train_batch_size=8, ), tokenizer=tokenizer, ) trainer.train()
PPO Fine-tuning with TRL: from trl import PPOTrainer, PPOConfig ppo_config = PPOConfig( learning_rate=1e-5, output_dir="./ppo_output", per_device_train_batch_size=4, gradient_accumulation_steps=1, ppo_epochs=3, num_rollouts=4, # samples per prompt ) ppo_trainer = PPOTrainer( model=sft_model, ref_model=sft_model, # Reference for KL penalty reward_model=rm_model, tokenizer=tokenizer, train_dataset=prompt_dataset, # Just prompts args=ppo_config, ) # Main loop for epoch in range(num_epochs): for batch in prompt_dataloader: prompts = batch["input_ids"] # Generate responses responses = ppo_trainer.generate(prompts) # Compute rewards rewards = reward_model(responses) # PPO update stats = ppo_trainer.step( prompts, responses, rewards ) # Save final model ppo_trainer.save_model("./final_model")

Hyperparameter Tuning: Key hyperparameters include SFT learning rate (1e-4 to 5e-4), RM learning rate (1e-5 to 1e-4), PPO learning rate (1e-5 to 1e-6), KL weight (0.01 to 0.1), and PPO epochs (1-4). Start conservative and scale up. Use a held-out validation set to pick the final checkpoint.

Hardware & Cost: Training a 7B model end-to-end (SFT + RM + PPO) on a single A100 GPU typically takes 2-4 weeks. For larger models (70B+), use multi-GPU or multi-node distributed training. Cloud providers (Lambda Labs, Modal, RunPod) offer GPUs by the hour; budget $1k-10k per full RLHF run depending on model size and dataset.

Open-Source Models Llama-2 Chat (fine-tuned with RLHF), Mistral Instruct, and Zephyr are all open-source models released after RLHF. Their checkpoints and training recipes serve as baselines for further tuning. Many labs openly share RLHF code and lessons learned.
SECTION 08

RLHF in Practice: Key Decisions

Successful RLHF deployments share a common set of careful choices that rarely appear in papers. Annotator selection matters enormously: annotators with domain expertise (e.g. medical professionals for medical RLHF) produce reward models that generalise better than crowdworkers whose preferences are noisier. Budget 20–30% of total annotation cost on annotator calibration — showing agreed examples and measuring inter-annotator agreement before starting the main labelling run.

Reward model architecture should mirror the policy model in scale. A reward model that is two size classes smaller than the policy will underfit to subtle quality differences the policy exploits. Train the reward model with a held-out "golden set" of 200–500 examples judged by senior annotators; use this set as a calibration probe at the end of each RM training run to detect over-fitting before it propagates to PPO.

For PPO, clip the KL penalty coefficient (beta) between 0.01 and 0.1. Lower beta allows more deviation from the SFT policy, which can improve helpfulness but risks reward hacking. Monitor the KL divergence metric directly during training; if it exceeds 20 nats, the policy is drifting too far and should be pulled back. A simple proxy for reward hacking: if the reward model score is rising but human preference ratings on a live sample are flat or declining, the policy has found a shortcut the reward model rewards but humans dislike.