Alignment & RLHF

ORPO

Odds Ratio Preference Optimisation โ€” trains on chosen responses while penalising rejected ones in a single stage, eliminating the need for a separate SFT step before DPO. Simpler pipeline, comparable results.

1-stage
Training
No SFT
Pre-step needed
Odds ratio
Loss term

Table of Contents

SECTION 01

ORPO vs DPO vs RLHF

RLHF (full): SFT โ†’ train reward model โ†’ PPO RL fine-tuning. Gold standard for quality but requires a separate reward model, PPO tuning (notoriously unstable), and 3โ€“4 separate training stages. Used by OpenAI and Anthropic for frontier models.

DPO: SFT โ†’ DPO on chosen/rejected pairs. Eliminates the reward model and RL training loop. Still requires a separate SFT stage first โ€” DPO needs a good base policy to align. Two-stage pipeline.

ORPO (Hong et al. 2024): single training stage. The ORPO loss combines supervised fine-tuning on chosen responses with a penalty for rejected responses in one unified objective. No SFT pre-step needed โ€” you train directly from a base model to an aligned model. Simpler pipeline, comparable quality to DPO on most benchmarks.

In practice: ORPO is the best default when you have a preference dataset and want the simplest possible pipeline. Use DPO when you already have a well-tuned SFT model you want to align. Use full RLHF only when you have the infrastructure and need frontier-quality alignment.

SECTION 02

The ORPO loss function

ORPO combines two components:

SFT loss: standard negative log-likelihood on chosen responses. Trains the model to generate the preferred outputs.

Odds ratio penalty: penalises the model for having higher relative probability on rejected responses. Defined as:

odds(x) = P(x) / (1 - P(x))
OR = odds(chosen) / odds(rejected)
L_OR = -log(sigmoid(log(OR)))

Total loss: L_ORPO = L_SFT + ฮป ยท L_OR

The odds ratio formulation is more stable than directly comparing log-probabilities (as in DPO) because it's bounded and less sensitive to the absolute probability scale. When the model assigns much higher probability to chosen than rejected, OR is large and L_OR is small โ€” the penalty disappears. When the model prefers rejected, OR < 1 and the penalty is large.

SECTION 03

Why odds ratio works

import torch

def orpo_loss_explained(log_probs_chosen, log_probs_rejected, lambda_: float = 0.1):
    # log_probs: sum of log-probabilities of each token in the response
    # shape: (batch,)

    # SFT component: maximise probability of chosen responses
    sft_loss = -log_probs_chosen.mean()

    # Odds ratio component
    # P(chosen) / (1 - P(chosen)) vs P(rejected) / (1 - P(rejected))
    # Working in log-space for numerical stability:
    # log_odds_chosen = log_probs_chosen - log(1 - exp(log_probs_chosen))

    def log_odds(log_p):
        # log_odds = log_p - log(1 - exp(log_p))
        # = log_p - softplus(-log_p) [numerically stable form]
        return log_p - torch.nn.functional.softplus(-log_p)

    log_or = log_odds(log_probs_chosen) - log_odds(log_probs_rejected)

    # We want log_or > 0 (model prefers chosen over rejected)
    # Loss is small when model strongly prefers chosen, large otherwise
    or_loss = -torch.nn.functional.logsigmoid(log_or).mean()

    total = sft_loss + lambda_ * or_loss
    return total, sft_loss.item(), or_loss.item()

# Intuition:
# If P(chosen)=0.8, P(rejected)=0.1:
#   odds_chosen = 0.8/0.2 = 4.0
#   odds_rejected = 0.1/0.9 = 0.11
#   OR = 4.0/0.11 = 36 (model strongly prefers chosen โ€” low penalty)
# If P(chosen)=0.3, P(rejected)=0.4:
#   OR = (0.3/0.7)/(0.4/0.6) = 0.43/0.67 = 0.64 (model prefers rejected โ€” high penalty)
SECTION 04

Training with ORPOTrainer

from trl import ORPOTrainer, ORPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Start from a base model (no SFT pre-training needed!)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",  # base model, not instruct
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

from datasets import load_dataset
dataset = load_dataset("mlabonne/orpo-dpo-mix-40k", split="train")
# Dataset has: prompt, chosen, rejected columns

trainer = ORPOTrainer(
    model=model,
    args=ORPOConfig(
        output_dir="./orpo-output",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=8e-6,          # similar to DPO โ€” small lr
        beta=0.1,                    # lambda for OR loss weight
        max_length=1024,
        max_prompt_length=512,
        bf16=True,
        logging_steps=10,
    ),
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()
SECTION 05

Dataset format

from datasets import Dataset

# ORPO expects the same format as DPO: prompt, chosen, rejected
# Each field is a string (not a message list โ€” just the text)

orpo_data = Dataset.from_list([
    {
        "prompt": "What is the capital of France?",
        "chosen": "The capital of France is Paris, which has served as the country's political and cultural centre since the 10th century.",
        "rejected": "France's capital is I think Paris or maybe Lyon, not totally sure.",
    },
    {
        "prompt": "Write a Python function to check if a number is prime.",
        "chosen": "def is_prime(n: int) -> bool:
    if n < 2: return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0: return False
    return True",
        "rejected": "def check_prime(n):
    for i in range(2, n):
        if n % i == 0:
            return False
    return True  # O(n) instead of O(sqrt(n))",
    },
])

# If your data is in chat format (message lists), convert first:
def format_for_orpo(example):
    chosen_text = tokenizer.apply_chat_template(example["chosen"], tokenize=False)
    rejected_text = tokenizer.apply_chat_template(example["rejected"], tokenize=False)
    prompt_text = tokenizer.apply_chat_template(
        example["chosen"][:-1], tokenize=False, add_generation_prompt=True
    )
    return {"prompt": prompt_text, "chosen": chosen_text, "rejected": rejected_text}
SECTION 06

When to choose ORPO

Choose ORPO when: you want the simplest pipeline (one training job instead of two), you're starting from a base model, you have a preference dataset with clear chosen/rejected pairs, or you're resource-constrained and can't run a two-stage SFT + DPO pipeline.

Choose DPO when: you already have a high-quality SFT model and want to align it without retraining from scratch. DPO fine-tunes only the alignment; ORPO fine-tunes both instruction following and alignment simultaneously โ€” if your SFT model is already very good, DPO may preserve more of that tuning.

Quality comparison: on most benchmarks (MT-Bench, AlpacaEval), ORPO produces comparable or slightly lower quality than SFT+DPO. The difference is small enough that the pipeline simplicity often justifies choosing ORPO. For frontier-quality models, full RLHF with a reward model still produces better results than both.

SECTION 07

Gotchas

ORPO's lambda (beta) needs tuning. Too small (0.01): the OR penalty is negligible and you get near-SFT behaviour. Too large (1.0): the OR penalty dominates and the model collapses to always outputting the minimum necessary to score higher than rejected, losing helpfulness. Start with beta=0.1 and tune based on win rate on a held-out eval set.

Chosen and rejected must differ meaningfully. If chosen and rejected responses are very similar (minor wording differences), the odds ratio provides almost no training signal. The most effective preference pairs have clear quality differences โ€” correct vs. incorrect, specific vs. vague, safe vs. unsafe. Build datasets where a naive observer can immediately tell which response is better.

Base model matters more for ORPO than DPO. DPO starts from an SFT model that already knows how to follow instructions. ORPO starts from a base model and must learn instruction following and preference alignment simultaneously. Using a weak base model (e.g., a very small model or a domain-specific model without instruction tuning) produces worse results with ORPO than DPO.

ORPO vs. Other Alignment Methods

Odds Ratio Preference Optimization (ORPO) integrates preference learning directly into the supervised fine-tuning loss, eliminating the need for a separate reward model or reference model. This makes training pipelines simpler, cheaper, and faster while achieving alignment quality competitive with RLHF and DPO.

MethodNeeds Reward ModelNeeds Reference ModelTraining PhasesCompute Cost
RLHF (PPO)YesYes (policy init)3 (SFT+RM+RL)High
DPONoYes2 (SFT+DPO)Medium
ORPONoNo1 (combined)Low
SimPONoNo1Low

The odds ratio in ORPO measures how much more likely the model is to generate the preferred response than the rejected response. By penalizing the rejected response directly within the SFT loss via a log-odds penalty term, the model simultaneously learns the target distribution and moves away from undesirable outputs โ€” all in a single forward-backward pass without needing to query a reference checkpoint.

ORPO is particularly well-suited for instruction fine-tuning scenarios where you have a curated dataset of preferred and rejected response pairs but lack the infrastructure to run a full RLHF pipeline. The reduced memory footprint โ€” no second reference model checkpoint in GPU memory โ€” makes it accessible even on single-GPU setups, enabling effective alignment of 7B and 13B models on consumer hardware with QLoRA.

Hyperparameter sensitivity is lower in ORPO than in DPO because there is no beta coefficient controlling the KL divergence penalty against a reference model. The key hyperparameter is the lambda coefficient weighting the odds ratio loss relative to the SFT cross-entropy loss. Values between 0.1 and 1.0 tend to work well; higher lambda values push the model more strongly away from rejected responses but can cause training instability if the preferred and rejected responses are stylistically similar rather than substantively different.

ORPO data requirements are the same as DPO: each training example needs a prompt, a preferred response, and a rejected response. High-quality preference data matters more than dataset size โ€” a few thousand well-labeled pairs with clear quality distinctions between preferred and rejected responses trains more reliably than tens of thousands of noisy pairs. Synthetic preference data generated by using a stronger model to rank completions from a weaker model is a practical way to bootstrap ORPO training when human-labeled data is scarce.

When evaluating ORPO training convergence, monitor the odds ratio metric alongside the standard training loss. A healthy training run shows the odds ratio steadily increasing โ€” the model becomes progressively more likely to generate preferred responses relative to rejected ones. If the odds ratio plateaus early while the SFT loss continues decreasing, the preference data quality may be the limiting factor rather than the learning rate or model capacity.