RLHF — GenAI Mindmap

Why RLHF
The Three Phases
Reward Model Training
PPO Fine-tuning Loop

Practical Challenges
RLHF vs Alternatives
Tools & Implementation

SECTION 01

Why RLHF

Supervised fine-tuning (SFT)—where you fine-tune a base model on a dataset of (prompt, ideal response) pairs—has a fundamental ceiling. Human annotators can write down ideal responses, but this static dataset cannot capture the infinite diversity of real-world use cases. Moreover, the "ideal response" is inherently subjective: some users prefer conciseness, others want detail; some value accuracy, others creativity.

RLHF solves this by treating human preferences as a training signal. Instead of labeling individual responses as "correct," annotators compare pairs of model outputs and indicate which is better. This preference data is then used to train a reward model that predicts "how much will humans prefer this completion?" Finally, a reinforcement learning algorithm (typically Proximal Policy Optimization, or PPO) tunes the language model's policy to maximize expected human-predicted reward.

The genius of RLHF lies in its leverage: a relatively small preference dataset (tens of thousands of comparisons) can train a reward model, which then guides the LLM's learning across millions of examples. This indirect approach—learning to predict preferences rather than memorizing ideal responses—proved transformative. ChatGPT's release in November 2022 demonstrated RLHF's power: the combination of an instruction-tuned base model plus RLHF produced behavior that felt more natural and helpful than SFT-only models.

The Preference Signal RLHF doesn't require "ground truth" answers—just comparisons. If two responses are nearly equally good, human raters indicate that. The reward model learns to interpolate and generalize, inferring quality even for cases unseen during annotation.

SECTION 02

The Three Phases of RLHF

Phase 1: Supervised Fine-Tuning (SFT)
Start with a pre-trained base model and fine-tune it on a small supervised dataset of high-quality (prompt, response) pairs, often written by expert annotators. This shifts the model's behavior from pure language modeling (predicting the next token) toward instruction-following and helpful responses. SFT is quick—usually a few hours on a single GPU—and establishes a reasonable baseline. Without SFT, the reward model in Phase 2 sees chaotic outputs and struggles to learn meaningful preferences.

Phase 2: Reward Model Training
Collect human preferences: for thousands of prompts, generate multiple candidate responses from the SFT model (or a set of baseline models), and have humans rank or compare them. This creates a dataset of preference pairs (response_A, response_B, preference: A is better). Train a new model (sharing the same architecture as the base model, but with a scalar reward head) to predict: given a prompt and response, what is the reward score? The reward model learns to align with human preferences via supervised learning on preference data.

Phase 3: Reinforcement Learning (PPO)
Use the reward model to provide a learning signal. For each prompt, generate one or more candidate responses from the fine-tuned policy. Compute the reward for each response. Apply PPO (or another RL algorithm) to update the policy to increase expected reward. Crucially, maintain a KL divergence penalty so the policy doesn't drift too far from the SFT model (to prevent degenerate solutions and preserve in-distribution behavior).

RLHF Pipeline Pseudocode: # Phase 1: SFT sft_model = finetune( base_model, dataset=sft_pairs, # 10k-50k (prompt, ideal_response) epochs=1-3 ) # Phase 2: Reward Model Training for prompt in preference_dataset: # Humans compared model outputs; recorded: A > B reward_A = reward_model(prompt, response_A) reward_B = reward_model(prompt, response_B) # Loss: maximize log(sigmoid(reward_A - reward_B)) loss += -log(sigmoid(reward_A - reward_B)) reward_model.update() # Phase 3: PPO Loop for epoch in range(num_epochs): for prompt in prompt_batch: # Sample from policy response = sft_model.sample(prompt) reward = reward_model(prompt, response) # Compute advantage estimate (A3C or GAE) advantage = reward - baseline(prompt) # PPO loss with KL penalty log_prob = sft_model.log_prob(response | prompt) loss = -advantage * log_prob + kl_weight * KL(policy || ref_model) policy.update(loss)

Why Three Phases? SFT warm-starts the policy in a reasonable region; RM training is parallelizable and economical (small dataset); PPO refinement is efficient because the RM provides dense gradients. Skipping SFT makes RM training much harder.

SECTION 03

Reward Model Training in Detail

The reward model is the linchpin of RLHF. Its job is to assign a scalar score reflecting human preference. Training it well is non-trivial.

Data Collection: Present human raters with a prompt and two model-generated responses (A and B). Raters indicate which is better (or "tie"). Typically, responses are sampled from the SFT model to ensure the RM sees the model's actual error modes. Higher diversity in prompts and response lengths improves RM generalization. Most successful projects collect 100k to 500k preference pairs.

Bradley-Terry Model: The standard approach is the Bradley-Terry model, which models pairwise comparisons as logistic: P(A > B) = sigmoid(r(A) - r(B)), where r(x) is the reward. The cross-entropy loss is: L = -log(sigmoid(r(A) - r(B))). This symmetric loss encourages the RM to score preferred responses higher and dis-preferred responses lower. It's robust to annotation noise: even if a few labels are flipped, the overall signal is preserved because the model learns the signal from thousands of pairs.

Bradley-Terry Loss and Gradient: import torch import torch.nn.functional as F # responses: (batch_size, seq_len) # preferences: (batch_size,) with 1 if A > B, 0 if B > A response_A, response_B = responses[:, 0], responses[:, 1] rewards_A = reward_model(response_A) # shape: (batch_size, 1) rewards_B = reward_model(response_B) # Bradley-Terry loss preference_logits = rewards_A - rewards_B loss = F.binary_cross_entropy_with_logits( preference_logits, preferences.float() ) # Optimization optimizer.zero_grad() loss.backward() optimizer.step() # Result: high-preferred responses get high reward, low-preferred get low reward

RM Validation: Hold out a test set of preferences and measure accuracy: does the RM correctly rank held-out preference pairs? Typical RM accuracy is 70-80%. Accuracy <65% suggests the RM hasn't converged or the annotations are too noisy. Accuracy >90% might indicate the RM has memorized or that the preference signal is trivial.

Calibration: Ensure the RM's reward distribution is reasonable. If all responses score >100 or <-100, the RM has saturated and won't provide useful gradients to PPO. Monitor reward statistics (mean, std) during training and adjust the learning rate or loss scaling if needed.

Common RM Pitfall: Reward Collapse If the RM always predicts the same reward regardless of response, PPO has no signal. This happens if the preference dataset is too skewed (e.g., 95% of comparisons favor the same response). Ensure diverse preference data and check RM calibration early.

SECTION 04

PPO Fine-tuning Loop

PPO (Proximal Policy Optimization) is a policy gradient algorithm that updates the model to maximize expected reward while staying close to the reference (SFT) model. The PPO loss has three components:

1. Reward Signal: The RM provides a scalar reward r for each generated response. Typically, the reward is clipped (e.g., between -1 and 1) to avoid extreme values that destabilize learning.

2. KL Divergence Penalty: The policy (language model) is encouraged to stay close to the reference model (SFT model) via a KL divergence penalty: KL(policy || reference). This prevents the policy from learning adversarial, degenerate outputs that game the RM. The penalty weight β is tuned (typically 0.01 to 0.1).

3. Advantage Estimation: Compute the advantage A(s) = reward - V(prompt), where V is a value function (baseline) that predicts expected reward given the prompt. This reduces variance in the gradient estimate. The value function is often a head on the policy model, trained to predict mean rewards.

PPO Loss with KL Penalty: # Generate responses under current policy prompts = batch["prompts"] # (batch_size,) responses = policy.sample(prompts) # (batch_size, seq_len) # Compute rewards and log probabilities rewards = reward_model(prompts, responses) log_probs = policy.log_prob(responses | prompts) # Baseline (value function) values = value_model(prompts) # Advantage estimation advantages = rewards - values # (batch_size,) advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) # Reference model log probs (for KL) ref_log_probs = ref_model.log_prob(responses | prompts) # PPO Loss: policy gradient + KL penalty policy_loss = -advantages * log_probs kl_loss = ref_log_probs - log_probs # KL divergence total_loss = policy_loss.mean() + beta * kl_loss.mean() # Value function MSE loss value_loss = F.mse_loss(values, rewards) # Total loss loss = total_loss + value_loss optimizer.zero_grad() loss.backward() optimizer.step()

Training Loop Dynamics: In each epoch, collect a batch of prompts, sample responses, compute rewards and advantages, and update both the policy and value function. PPO training is more unstable than SFT: learning rates, KL weight, batch size, and advantage estimation method all matter. Most projects do 1-3 PPO epochs and early-stop based on held-out reward. Training typically lasts 2-4 days on a single large GPU or scales well on multi-GPU.

KL Weight Tuning Start with β = 0.05 and monitor KL divergence during training. If KL stays <0.5 and reward increases, you're in the good regime. If KL jumps >2.0, increase β. If KL <0.1 and reward plateaus, decrease β to allow more exploration.

SECTION 05

Practical Challenges in RLHF

Reward Hacking: The policy can find adversarial outputs that score high on the RM but are actually bad. For example, if the RM is trained to prefer long responses, the policy might generate repetitive text. Mitigation: ensure diversity in the RM training data, use multiple RM heads (ensemble), and include a length penalty in the reward signal. Some labs train RM on "hard negatives"—responses that are technically wrong but score high on surface metrics.

Mode Collapse: The policy converges to generating the same response for all prompts. This occurs if the RM signal is weak or if the KL penalty is too loose. Mitigation: increase entropy regularization (encourage the policy to sample diverse tokens) or reduce KL weight incrementally. Monitor mode collapse by tracking unique tokens and response diversity.

Compute Cost: RLHF is expensive. Generating responses (sampling from the policy), scoring them (forward pass through RM), and running PPO require significant compute. Scaling to large models (100B+ parameters) becomes prohibitive. Mitigation: use inference optimization (quantization, KV caching), batch samples across multiple GPUs, or use distributed PPO algorithms.

Annotation Quality: Human annotators have biases, disagreements, and fatigue. Inter-rater agreement is typically 60-70%; some pairs are ambiguous. Low-quality annotations corrupt the RM. Mitigation: use multiple raters per pair and aggregate via voting, train raters on rubrics, and monitor annotation consistency over time. Consider using AI judges (e.g., a strong LLM) to augment human feedback.

Reward Model Overfitting The RM can overfit to the SFT model's outputs, learning spurious features instead of genuine preference signals. If you then use PPO to push the policy away from the SFT model, the RM's predictions become unreliable. Mitigation: validate the RM on diverse, high-quality held-out preference data; use a diverse set of reference models in Phase 1.

SECTION 06

RLHF vs Alternatives

RLHF is powerful but not the only path. Alternative alignment approaches have emerged, each with tradeoffs:

Method	Key Idea	Pros	Cons
RLHF	Train RM on preferences; use PPO	Proven at scale; leverages human judgment; indirect learning	Expensive; complex pipeline; RM reward hacking
DPO	Optimize policy directly on preference pairs, no RM	Simpler; fewer hyperparrams; faster; stable	Needs larger preference dataset; less explored at very large scale
RLAIF	Use AI judges (strong LLM) instead of humans for preference	Scales without human annotation; cheaper; faster iteration	Risk of circular reasoning; AI judge biases propagate
Constitutional AI	Finetune on critiques from principle-based LLM judge	Interpretable; aligns to explicit values; low annotation cost	Critique generation can be weak; depends on constitution phrasing

In practice, many labs use a hybrid: start with RLHF to achieve good performance, then switch to DPO (cheaper, faster) for iterative refinement. RLAIF and Constitutional AI are exciting for scaling—they reduce human annotation burden—but they introduce new risks (e.g., if the AI judge is biased, the bias propagates).

Modern Trends The field is moving toward simpler, more stable alternatives to RLHF. DPO (2023) demonstrated that you don't need an explicit RM. Recent work on online DPO and iterative preference optimization suggest even simpler approaches. However, RLHF remains the proven path to state-of-the-art quality at scale.

SECTION 07

Tools & Implementation

The Hugging Face TRL (Transformers Reinforcement Learning) library is the go-to toolkit for RLHF and variants. It provides SFTTrainer, RewardTrainer, and PPOTrainer that abstract away complexity.

SFT with TRL: from trl import SFTTrainer from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "meta-llama/Llama-2-7b" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Load SFT dataset: list of {"text": "prompt\nresponse"} sft_dataset = load_dataset( "json", data_files="sft_data.jsonl" ) trainer = SFTTrainer( model, train_dataset=sft_dataset["train"], args=SFTConfig( output_dir="./sft_output", learning_rate=2e-4, max_steps=10000, per_device_train_batch_size=4, ), tokenizer=tokenizer, ) trainer.train()

Reward Model Training with TRL: from trl import RewardTrainer import torch # Dataset: list of { # "input_ids_chosen": [...], # "attention_mask_chosen": [...], # "input_ids_rejected": [...], # "attention_mask_rejected": [...] # } rm_dataset = load_dataset( "json", data_files="preferences.jsonl" ) # Reward model: base model with scalar head rm_model = AutoModelForCausalLM.from_pretrained( "path/to/sft_model" ) # Add reward head (add linear layer on top of last hidden state) trainer = RewardTrainer( model=rm_model, train_dataset=rm_dataset["train"], args=RewardConfig( output_dir="./rm_output", learning_rate=1e-4, max_steps=5000, per_device_train_batch_size=8, ), tokenizer=tokenizer, ) trainer.train()

PPO Fine-tuning with TRL: from trl import PPOTrainer, PPOConfig ppo_config = PPOConfig( learning_rate=1e-5, output_dir="./ppo_output", per_device_train_batch_size=4, gradient_accumulation_steps=1, ppo_epochs=3, num_rollouts=4, # samples per prompt ) ppo_trainer = PPOTrainer( model=sft_model, ref_model=sft_model, # Reference for KL penalty reward_model=rm_model, tokenizer=tokenizer, train_dataset=prompt_dataset, # Just prompts args=ppo_config, ) # Main loop for epoch in range(num_epochs): for batch in prompt_dataloader: prompts = batch["input_ids"] # Generate responses responses = ppo_trainer.generate(prompts) # Compute rewards rewards = reward_model(responses) # PPO update stats = ppo_trainer.step( prompts, responses, rewards ) # Save final model ppo_trainer.save_model("./final_model")

Hyperparameter Tuning: Key hyperparameters include SFT learning rate (1e-4 to 5e-4), RM learning rate (1e-5 to 1e-4), PPO learning rate (1e-5 to 1e-6), KL weight (0.01 to 0.1), and PPO epochs (1-4). Start conservative and scale up. Use a held-out validation set to pick the final checkpoint.

Hardware & Cost: Training a 7B model end-to-end (SFT + RM + PPO) on a single A100 GPU typically takes 2-4 weeks. For larger models (70B+), use multi-GPU or multi-node distributed training. Cloud providers (Lambda Labs, Modal, RunPod) offer GPUs by the hour; budget $1k-10k per full RLHF run depending on model size and dataset.

Open-Source Models Llama-2 Chat (fine-tuned with RLHF), Mistral Instruct, and Zephyr are all open-source models released after RLHF. Their checkpoints and training recipes serve as baselines for further tuning. Many labs openly share RLHF code and lessons learned.

RLHF in Depth

Table of Contents

Why RLHF

The Three Phases of RLHF

Reward Model Training in Detail

PPO Fine-tuning Loop

Practical Challenges in RLHF

RLHF vs Alternatives

Tools & Implementation

RLHF in Practice: Key Decisions

RLHF in Depth

Table of Contents

Why RLHF

The Three Phases of RLHF

Reward Model Training in Detail

PPO Fine-tuning Loop

Practical Challenges in RLHF

RLHF vs Alternatives

Tools & Implementation

RLHF in Practice: Key Decisions

Related concepts