HF TRL

What TRL provides
SFTTrainer for instruction tuning
Chat templates and data formats
DPOTrainer for preference alignment
Monitoring with W&B
Memory flags
Gotchas

SECTION 01

What TRL provides

TRL (Transformer Reinforcement Learning) is HuggingFace's library for fine-tuning LLMs beyond standard next-token prediction. It wraps the HuggingFace Trainer with LLM-specific additions: chat template handling, completions-only loss masking, preference dataset formats, and RL training loops.

Core trainers: SFTTrainer (Supervised Fine-Tuning — instruction following), DPOTrainer (Direct Preference Optimisation — align to preferred responses without RL), ORPOTrainer (single-stage SFT + preference), PPOTrainer (full RLHF with a reward model — rarely needed now that DPO exists), and RewardTrainer (train a reward model for RLHF).

For most fine-tuning tasks: use SFTTrainer for instruction following, DPOTrainer when you have chosen/rejected pairs. PPOTrainer is complex to tune and only justified when you have a reliable reward model and need iterative online training.

SECTION 02

SFTTrainer for instruction tuning

from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Dataset in conversational format
data = Dataset.from_list([
    {"messages": [
        {"role": "system", "content": "You are a Python expert."},
        {"role": "user", "content": "Write a function to merge two sorted lists."},
        {"role": "assistant", "content": "def merge_sorted(a, b):
    result = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] <= b[j]: result.append(a[i]); i += 1
        else: result.append(b[j]); j += 1
    return result + a[i:] + b[j:]"}
    ]},
])

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=data,
    args=SFTConfig(
        output_dir="./sft-output",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=2e-5,
        bf16=True,
        max_seq_length=2048,
        dataset_text_field="messages",     # tell SFT to apply chat template
    ),
)
trainer.train()

SECTION 03

Chat templates and data formats

SFTTrainer applies the model's chat template to convert message lists into training strings. This ensures the model learns to produce outputs in the exact format expected at inference (system prompts, user/assistant tags, EOS tokens).

# Preview what the chat template produces
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)
# Qwen2 output:
# <|im_start|>system
You are a helpful assistant.<|im_end|>

# <|im_start|>user
What is 2+2?<|im_end|>

# <|im_start|>assistant
4<|im_end|>


# completions_only_loss: only compute loss on assistant turns (not on prompt)
# This is critical — without it, the model trains to reproduce the system/user turns too
from trl import DataCollatorForCompletionOnlyLM

# Find the response template token IDs
response_template = "<|im_start|>assistant
"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template,
    tokenizer=tokenizer,
)
# Pass to SFTTrainer as data_collator= argument

SECTION 04

DPOTrainer for preference alignment

from trl import DPOTrainer, DPOConfig
from datasets import Dataset

# DPO dataset: each example has prompt, chosen response, rejected response
dpo_data = Dataset.from_list([
    {
        "prompt": "Explain quantum entanglement.",
        "chosen": "Quantum entanglement is a phenomenon where two particles become correlated so that measuring one instantly determines the state of the other, regardless of distance. Einstein called this 'spooky action at a distance'.",
        "rejected": "It's when particles are connected somehow. Very complicated quantum stuff."
    },
])

# DPO fine-tunes on chosen while penalising rejected — no reward model needed
trainer = DPOTrainer(
    model=model,
    ref_model=None,   # None = use the initial model as reference (auto-handled)
    args=DPOConfig(
        output_dir="./dpo-output",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        learning_rate=5e-7,    # DPO needs a MUCH smaller lr than SFT
        beta=0.1,              # KL penalty — higher = stay closer to reference
        bf16=True,
    ),
    train_dataset=dpo_data,
    tokenizer=tokenizer,
)
trainer.train()

DPO requires a model already instruction-tuned (via SFT). Don't run DPO on a raw base model — the reference point needs to be a helpful model, not a next-token predictor.

SECTION 05

Monitoring with W&B

import os
os.environ["WANDB_PROJECT"] = "llm-finetuning"

from trl import SFTConfig

# W&B integration is automatic when wandb is installed
# Just add report_to="wandb" to training args
args = SFTConfig(
    output_dir="./output",
    report_to="wandb",          # or "tensorboard", "none"
    run_name="llama3-sft-v1",   # experiment name in W&B
    logging_steps=10,           # log every 10 steps
    eval_steps=100,
    eval_strategy="steps",
    save_steps=500,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    bf16=True,
)

# Key metrics to watch in W&B:
# - train/loss: should decrease smoothly; spikes indicate instability
# - eval/loss: track gap with train/loss — growing gap = overfitting
# - train/grad_norm: should stay 0.1-10; >100 = exploding gradients
# - train/learning_rate: verify warmup and decay schedule
# - GPU memory usage: check via nvidia-smi or W&B system metrics

# Save model checkpoints and push to Hub
args_with_hub = SFTConfig(
    output_dir="./output",
    push_to_hub=True,
    hub_model_id="your-username/your-model-name",
    hub_strategy="checkpoint",  # push each checkpoint
)

SECTION 06

Memory flags

from trl import SFTConfig

# Essential memory optimisation flags
args = SFTConfig(
    output_dir="./output",

    # Gradient checkpointing: recompute activations on backward pass
    # ~60% less activation memory, ~20% slower training
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},  # avoids warnings

    # Mixed precision: compute in bf16/fp16, accumulate gradients in fp32
    bf16=True,         # preferred on Ampere+ (A100, H100, RTX 30xx+)
    # fp16=True,       # use for older GPUs (V100, T4)

    # Gradient accumulation: simulate larger batch without more VRAM
    # effective_batch = per_device_batch * gradient_accumulation * num_gpus
    per_device_train_batch_size=2,
    gradient_accumulation_steps=16,   # effective batch = 32

    # Flash Attention 2 (requires flash-attn package)
    # pip install flash-attn --no-build-isolation
    # model = AutoModelForCausalLM.from_pretrained(..., attn_implementation="flash_attention_2")

    # 8-bit AdamW (bitsandbytes) — halves optimiser state memory
    # Useful when adapters are large (high rank)
    optim="paged_adamw_8bit",

    # Max sequence length — directly impacts activation memory
    max_seq_length=2048,   # reduce if OOM; most instruction tuning works at 2048
)

SECTION 07

Gotchas

DPO needs a much smaller learning rate than SFT. SFT uses 1e-5 to 2e-4. DPO typically needs 5e-7 to 5e-6 — about 100× smaller. Using SFT learning rates for DPO causes catastrophic forgetting: the model completely loses its instruction-following ability while "aligning" to the preference data.

Pack sequences for faster training. By default, each training example is padded to max_seq_length, wasting compute on padding tokens. Set packing=True in SFTConfig to pack multiple short examples into each sequence up to max_seq_length. This can 2–4× your training throughput for datasets with short examples.

DPO with ref_model=None saves memory but requires care. When ref_model=None, TRL uses the training model's initial state as the reference. This works but the reference gets stale as the model trains. For high-quality alignment, pass a frozen copy of the SFT model as ref_model — at the cost of 2× memory.

SECTION 08

TRL Trainer Comparison

Trainer	Use Case	Data Format	Key Config
SFTTrainer	Supervised instruction tuning	{"prompt": ..., "completion": ...}	max_seq_length, packing
DPOTrainer	Preference alignment without reward model	{"prompt", "chosen", "rejected"}	beta (KL penalty), loss_type
PPOTrainer	RL with reward model	{"query": ...} + reward function	kl_penalty, cliprange
ORPOTrainer	Odds-ratio preference (no ref model)	{"prompt", "chosen", "rejected"}	lambda (ORPO weight)
RewardTrainer	Train scalar reward model	{"input_ids_chosen", "input_ids_rejected"}	max_length, num_labels

TRL integrates tightly with HuggingFace PEFT, so you can combine any Trainer with LoRA by passing a PeftConfig via the model_init_kwargs or by wrapping the base model with get_peft_model before passing it to the Trainer. For memory-constrained setups, set gradient_checkpointing=True and use bf16 precision. On a single A100 80GB GPU, SFTTrainer with a 7B model, LoRA rank=16, and sequence length 2048 typically achieves 2,000-3,000 tokens per second throughput.

Monitor training health with three key metrics: training loss (should decrease steadily), gradient norm (spikes indicate learning rate too high), and evaluation loss on a held-out set (rising divergence means overfitting). TRL logs all three to Weights and Biases automatically when wandb is installed. Set eval_steps to run evaluation every 10-20% of training to catch overfitting before it becomes severe.

TRL (Transformer Reinforcement Learning) has become the standard library for RLHF-based fine-tuning because it abstracts away the complex reward model training loop, PPO clipping, KL divergence penalties, and reference model management that are otherwise error-prone to implement from scratch. The library integrates seamlessly with HuggingFace Hub, making it easy to push checkpoints during long training runs.

When choosing between SFT and PPO in TRL, consider that SFT is deterministic and reproducible while PPO introduces variance through the reward signal. For production use cases requiring consistent quality, SFT with a well-curated dataset often outperforms RLHF on narrow tasks. PPO shines for open-ended generation tasks where quality is hard to define in advance but easy to judge after the fact.