SAFETY & ROBUSTNESS

LLM Safety Techniques

Constitutional AI, red-teaming, guardrails, and jailbreak defenses — the technical toolkit for safe deployment

prevent · detect · mitigate the three pillars
jailbreaks, bias, hallucination what can go wrong
train-time + runtime two intervention layers
Contents
  1. The safety problem space
  2. Constitutional AI & RLAIF
  3. Red-teaming & evaluation
  4. Runtime guardrails
  5. Jailbreak patterns & defenses
  6. Privacy & memorization
  7. Safety evaluation & red lines
01 — Overview

The Safety Problem Space

LLM safety is not one problem — it's a cluster of distinct failure modes, each requiring different defenses. Conflating them obscures solutions. A safety system must address harmful outputs, jailbreaks, hallucination, bias, privacy leakage, and prompt injection as separate but complementary concerns.

Safety Failure Taxonomy

Harmful outputs: Instructions for violence, CSAM, bioweapons — direct policy violations. Jailbreaks: Prompt injections that bypass safety training through creative phrasing or role-play. Hallucination: Confident wrong answers, usually factual but sometimes harmful. Bias and fairness: Systematic disparate outputs across demographic groups. Privacy leakage: Memorized PII or training data extracted via prompting. Prompt injection: Malicious content in retrieved context that hijacks the model.

Failure typeWho is harmedMain defenseHard to fully solve?
Harmful outputsThird partiesAlignment trainingYes
JailbreaksUsers / societyRed-teaming + filtersYes
HallucinationUsersGrounding, evalsYes
BiasAffected groupsData curation, red-teamYes
Privacy leakageIndividualsDifferential privacy, filteringPartial
Prompt injectionDeployersInput sanitization, separationYes
💡 Key insight: Defense-in-depth is mandatory. No single technique solves safety. Combine alignment training, red-teaming, guardrails, input/output filtering, and monitoring.
02 — Alignment

Constitutional AI and RLAIF

Constitutional AI (CAI) is Anthropic's approach: instead of human preference labels, use a list of principles (the "constitution") to generate synthetic feedback. A capable model critiques and revises its own outputs using the constitution. The revised outputs become training data for alignment.

RLAIF (RL from AI Feedback) trains a reward model on AI-generated preference data instead of human-labeled data. This scales without human annotation budgets. The same constitution guides both critique and reward modeling.

CAI Process

1

Constitution — principles guide training

Write explicit principles that define safe/helpful behavior. Examples: "Be helpful. Be honest. Minimize harm." The constitution is your values codified.

  • Clear, specific principles (vague guidelines → vague feedback)
  • Balance competing values (honesty vs. helpfulness)
  • Iterate on principles as model behavior reveals gaps
2

Critique — model self-evaluates

Ask a strong model to critique its own outputs using the constitution. The model identifies flaws and explains them in the constitution's terms.

  • Prompt: "Critique this response against principle X..."
  • Model generates detailed critique or passes if output is acceptable
  • No human annotation needed
3

Revision — model self-improves

The model revises outputs to address critiques. The revised version is now better-aligned with the constitution.

  • Prompt: "Revise your response to address the critique..."
  • Model generates improved version
  • Results become training data (revised output is "preferred")
4

Fine-Tune — supervised learning on revisions

SFT on the revised (better) outputs. This directly trains the model on aligned behavior without reward models or PPO complexity.

  • Dataset: (prompt, revised_response) pairs
  • Simple cross-entropy loss
  • Repeat: critique/revise cycle → more training data
Constitutional AI critique prompt pattern: System: "Consider whether the following response is harmful, deceptive, or dishonest according to principle X..." [assistant response] Critique: "The response [identifies issue]..." Revision: "[Improved response that addresses the critique]"
CAI advantages: Scalable (no human labelers), consistent (same constitution everywhere), auditable (constitution is readable). Claude's values come from its constitution, not just human labels.
⚠️ CAI limitations: Works best for style and behavior alignment. For factual correctness or domain expertise, human feedback is still necessary. You cannot critique what you don't know.
03 — Testing

Red-Teaming and Adversarial Evaluation

Red-teaming means systematically attempting to elicit harmful outputs before deployment. You attack your own system to find vulnerabilities, then patch them. Red-teaming has two forms: manual (human experts) and automated (LLM-based or gradient-based).

Red-Teaming Approaches

1

Manual Red-Teaming — expert judgment

Human security experts and domain specialists manually probe the model for failure modes in target categories.

  • High creativity, finds novel attacks
  • Expensive and limited in scale
  • Requires security expertise
  • Best for critical systems
Python · Automated red-teaming with Garak (open-source LLM security scanner)
# Install: pip install garak
# Garak tests LLMs for: jailbreaks, prompt injection, toxicity,
# hallucination, data leakage, and more.

# Command-line usage (simplest):
# python -m garak --model_type openai --model_name gpt-4o #   --probes dan.Dan_11_0,encoding,continuation

# Programmatic usage:
import garak.cli

def run_security_scan(model_name: str, probe_categories: list[str]) -> dict:
    """Run a subset of Garak probes against an OpenAI model."""
    # Build probe list
    probes = ",".join(probe_categories)

    # Run scan (writes results to ./garak_runs/)
    import subprocess
    result = subprocess.run([
        "python", "-m", "garak",
        "--model_type", "openai",
        "--model_name", model_name,
        "--probes", probes,
        "--report_prefix", f"scan_{model_name.replace('/', '_')}"
    ], capture_output=True, text=True)

    return {
        "model": model_name,
        "probes": probe_categories,
        "stdout": result.stdout[-2000:],
        "returncode": result.returncode
    }

# Common probe categories for a standard security baseline:
scan = run_security_scan(
    model_name="gpt-4o",
    probe_categories=[
        "dan.Dan_11_0",      # jailbreak: Do Anything Now variant
        "encoding",          # encoding-based bypasses (base64, ROT13)
        "continuation",      # harmful text continuation
        "promptinject",      # prompt injection attacks
        "leakreplay",        # training data memorization
    ]
)
print(f"Scan complete. Exit: {scan['returncode']}")
2

PAIR (Prompt Automatic Iterative Refinement) — LLM-based

A separate attacker LLM iteratively refines jailbreak prompts. Each iteration, the attacker sees the target model's response and generates a better attack.

  • Highly scalable (automated)
  • Good at semantic/creative jailbreaks
  • Moderate transferability across models
  • Medium compute cost
3

GCG (Greedy Coordinate Gradient) — gradient-based

Optimize adversarial suffix at the token level via gradient descent. Transfer well across models. Finds token-level exploits that semantic methods miss.

  • High transferability (same suffix works on different models)
  • Finds exploits that semantic attacks miss
  • Medium compute cost
  • Less human-interpretable attacks
4

Fuzzing — random perturbations

Generate random variations of inputs and check for failures. Low sophistication but catches edge cases that structured tests miss.

  • Very high scale (generate millions of variants)
  • Finds edge cases and unexpected interactions
  • Low transferability
  • High false-positive rate
MethodScaleTransferabilityFindsEffort
Manual expertLowN/ACreative, novel attacksHigh
Automated LLM (PAIR)HighMediumSemantic jailbreaksLow
Gradient-based (GCG)MediumHigh (transfers)Token-level exploitsMedium
FuzzingVery highLowEdge casesLow
💡 Best practice: Combine methods. Manual red-teaming finds creative attacks; PAIR scales semantic attacks; GCG finds transferable exploits; fuzzing catches edge cases. Run all in parallel during development.
04 — Deployment

Runtime Guardrails

Train-time alignment is necessary but insufficient. Runtime filters provide defense in depth — a safety net that catches violations that slipped through training. Input guardrails catch attacks before they reach the model; output guardrails catch policy violations before they reach users.

Guardrail Methods

1

Regex & Keyword Filters — fast, brittle

Pattern match against known bad inputs/outputs. Zero latency. Easily bypassed by paraphrasing.

  • Good for: Known patterns (credit card numbers, known slurs)
  • Pros: Fast, deterministic, no model calls
  • Cons: Brittle, high false-positive rate, easy to bypass
  • Use for: First-pass filtering, not primary defense
2

ML Classifiers — learned patterns

Train a small BERT-class model to classify harmful intent. Good precision on known categories. Misses novel attacks.

  • Good for: Toxicity, hate speech, known harm categories
  • Pros: Generalizes beyond exact keywords
  • Cons: Limited to training data; novel attacks slip through
  • Use for: Reliable filtering on known harm types
3

LLM-as-Judge — semantic understanding

Use a separate (smaller) LLM to evaluate I/O pairs. High quality understanding. Adds latency + cost.

  • Good for: Complex policy violations, context-aware harms
  • Pros: Semantic understanding, handles novel cases
  • Cons: Adds 500ms+ latency, increases cost
  • Use for: Critical systems where latency is acceptable
4

Prompt Injection Detection — special case

Detect instructions embedded in user content or retrieved context attempting to override system prompt.

  • Techniques: Semantic similarity to system prompt keywords, instruction phrases
  • Tools: Rebuff, Guardrails AI (DAN detection)
  • Challenge: True positives hard to distinguish from legitimate content

Guardrail Tools

Framework
LlamaGuard
Meta's safety classifier for input/output toxicity.
Framework
NeMo Guardrails
NVIDIA's guardrail framework with modular rules.
Framework
Guardrails AI
Python framework for building structured guardrails.
Specialization
Rebuff
Prompt injection detection and mitigation.
API
Azure Content Safety
Managed content classification API.
API
Perspective API
Google's toxicity and bias classifier.
05 — Attacks

Jailbreak Patterns and Defenses

Jailbreaks are creative prompts that bypass safety training. Understanding common patterns helps you defend against them. No single defense is complete — you must defend against multiple categories simultaneously.

Major Jailbreak Categories

Role-play attacks: "Pretend you are DAN (Do Anything Now)" — the model assumes a persona that ignores safety rules. Indirect injection: Hide instructions in documents the model is asked to summarize. Many-shot jailbreaking: Fill context with examples of the model complying with harmful requests. Multilingual bypass: Request harmful content in low-resource languages where safety training is weaker. Encoding bypass: ROT13, Base64, or pig Latin to evade keyword filters.

Jailbreak typeExampleDefenseEffectiveness
Role-play"pretend you are DAN"Training robustnessHigh
Indirect injectionMalicious text in retrieved docSeparation of system/user/contextMedium
Many-shot100 examples of complianceConstitutional trainingMedium
EncodingBase64 harmful requestDecode before filteringHigh
MultilingualRequest in SwahiliMultilingual safety trainingMedium
⚠️ Critical: No single defense is complete. Defense-in-depth is the only robust approach. Train-time alignment + runtime guardrails + monitoring + incident response.
06 — Data

Privacy and Memorization

LLMs memorize training data. Verbatim reproduction of training text is measurable and extractable. Extraction attacks prompt the model to repeat memorized content. Defenses reduce memorization but don't eliminate it.

Extraction Attacks

Prefix attack: Prompt: "Repeat the following text:" + random prefix. If the model completes with memorized content, the attack succeeds. Membership inference: Can an adversary determine if a specific text was in the training set? Yes, at rates above random using perplexity measurements.

Memorization Defenses

Deduplication: Remove duplicate training examples before training. Simple and effective — reduces memorization by 10× (Lee et al. 2022). Differential privacy: Add noise during training. More principled but lower impact at scale. Output filtering: Detect and filter PII patterns in model outputs. Catches obvious leakage but misses creative extractions.

Memorization measurement: Exact memorization rate (at scale): - Large models: 1-5% of training set reproduced verbatim - Smaller models: 0.1-1% - Deduplication reduces this by ~10× Membership inference (probabilistic): - Perplexity difference between in-distribution and out - Not 100% accurate but above random
⚠️ Practical insight: Deduplication is simpler and more effective than differential privacy at realistic scales. Deduplicate your training data ruthlessly. Differential privacy adds significant training overhead for modest gains.
07 — Quality

Safety Evaluation and Red Lines

Safety evaluation requires defining what "safe" means. Absolute red lines (universal violations) are rare. Most policies exist on a spectrum with precision/recall tradeoffs. A model that refuses benign medical questions harms users as much as one that gives harmful advice.

Evaluation Benchmarks

HELM (Holistic Evaluation of Language Models): Includes toxicity, bias, and disinformation benchmarks. Evaluates real use cases with safety constraints. BBQ (Bias Benchmark for QA): Measures social bias in question answering across demographic groups. ToxiGen: Tests for implicit toxicity and bias against minority groups — harder than explicit toxicity.

Policy Framework

Define three levels: Red lines (universal): CSAM, detailed synthesis instructions for mass-casualty weapons, targeted harassment campaigns. High precision: Requests for harmful instructions; require high confidence before refusal. Gray area: Controversial topics; balance informativeness vs. harm. Track refusal rates by category — over-refusal is also a failure.

Key insight: Over-refusal is a safety failure. If your model refuses benign medical, legal, or educational questions to avoid false positives, you're harming users. Measure refusal rates as a safety metric.

Safety Evaluation Tools

Benchmark
HELM
Holistic evaluation including safety constraints.
Benchmark
BBQ
Bias benchmark for QA systems.
Benchmark
ToxiGen
Implicit toxicity and bias evaluation.
Framework
Safety Taxonomy
Custom red lines and policy definitions.
08 — Further Reading

References

Academic Papers
Frameworks & Guides