LLM Safety Techniques

Contents

The safety problem space
Constitutional AI & RLAIF
Red-teaming & evaluation
Runtime guardrails
Jailbreak patterns & defenses
Privacy & memorization
Safety evaluation & red lines

01 — Overview

The Safety Problem Space

LLM safety is not one problem — it's a cluster of distinct failure modes, each requiring different defenses. Conflating them obscures solutions. A safety system must address harmful outputs, jailbreaks, hallucination, bias, privacy leakage, and prompt injection as separate but complementary concerns.

Safety Failure Taxonomy

Harmful outputs: Instructions for violence, CSAM, bioweapons — direct policy violations. Jailbreaks: Prompt injections that bypass safety training through creative phrasing or role-play. Hallucination: Confident wrong answers, usually factual but sometimes harmful. Bias and fairness: Systematic disparate outputs across demographic groups. Privacy leakage: Memorized PII or training data extracted via prompting. Prompt injection: Malicious content in retrieved context that hijacks the model.

Failure type	Who is harmed	Main defense	Hard to fully solve?
Harmful outputs	Third parties	Alignment training	Yes
Jailbreaks	Users / society	Red-teaming + filters	Yes
Hallucination	Users	Grounding, evals	Yes
Bias	Affected groups	Data curation, red-team	Yes
Privacy leakage	Individuals	Differential privacy, filtering	Partial
Prompt injection	Deployers	Input sanitization, separation	Yes

💡 Key insight: Defense-in-depth is mandatory. No single technique solves safety. Combine alignment training, red-teaming, guardrails, input/output filtering, and monitoring.

02 — Alignment

Constitutional AI and RLAIF

Constitutional AI (CAI) is Anthropic's approach: instead of human preference labels, use a list of principles (the "constitution") to generate synthetic feedback. A capable model critiques and revises its own outputs using the constitution. The revised outputs become training data for alignment.

RLAIF (RL from AI Feedback) trains a reward model on AI-generated preference data instead of human-labeled data. This scales without human annotation budgets. The same constitution guides both critique and reward modeling.

CAI Process

Constitution — principles guide training

Write explicit principles that define safe/helpful behavior. Examples: "Be helpful. Be honest. Minimize harm." The constitution is your values codified.

Clear, specific principles (vague guidelines → vague feedback)
Balance competing values (honesty vs. helpfulness)
Iterate on principles as model behavior reveals gaps

Critique — model self-evaluates

Ask a strong model to critique its own outputs using the constitution. The model identifies flaws and explains them in the constitution's terms.

Prompt: "Critique this response against principle X..."
Model generates detailed critique or passes if output is acceptable
No human annotation needed

Revision — model self-improves

The model revises outputs to address critiques. The revised version is now better-aligned with the constitution.

Prompt: "Revise your response to address the critique..."
Model generates improved version
Results become training data (revised output is "preferred")

Fine-Tune — supervised learning on revisions

SFT on the revised (better) outputs. This directly trains the model on aligned behavior without reward models or PPO complexity.

Dataset: (prompt, revised_response) pairs
Simple cross-entropy loss
Repeat: critique/revise cycle → more training data

Constitutional AI critique prompt pattern: System: "Consider whether the following response is harmful, deceptive, or dishonest according to principle X..." [assistant response] Critique: "The response [identifies issue]..." Revision: "[Improved response that addresses the critique]"

✓ CAI advantages: Scalable (no human labelers), consistent (same constitution everywhere), auditable (constitution is readable). Claude's values come from its constitution, not just human labels.

⚠️ CAI limitations: Works best for style and behavior alignment. For factual correctness or domain expertise, human feedback is still necessary. You cannot critique what you don't know.

03 — Testing

Red-Teaming and Adversarial Evaluation

Red-teaming means systematically attempting to elicit harmful outputs before deployment. You attack your own system to find vulnerabilities, then patch them. Red-teaming has two forms: manual (human experts) and automated (LLM-based or gradient-based).

Red-Teaming Approaches

Manual Red-Teaming — expert judgment

Human security experts and domain specialists manually probe the model for failure modes in target categories.

High creativity, finds novel attacks
Expensive and limited in scale
Requires security expertise
Best for critical systems

Python · Automated red-teaming with Garak (open-source LLM security scanner)

# Install: pip install garak
# Garak tests LLMs for: jailbreaks, prompt injection, toxicity,
# hallucination, data leakage, and more.

# Command-line usage (simplest):
# python -m garak --model_type openai --model_name gpt-4o #   --probes dan.Dan_11_0,encoding,continuation

# Programmatic usage:
import garak.cli

def run_security_scan(model_name: str, probe_categories: list[str]) -> dict:
    """Run a subset of Garak probes against an OpenAI model."""
    # Build probe list
    probes = ",".join(probe_categories)

    # Run scan (writes results to ./garak_runs/)
    import subprocess
    result = subprocess.run([
        "python", "-m", "garak",
        "--model_type", "openai",
        "--model_name", model_name,
        "--probes", probes,
        "--report_prefix", f"scan_{model_name.replace('/', '_')}"
    ], capture_output=True, text=True)

    return {
        "model": model_name,
        "probes": probe_categories,
        "stdout": result.stdout[-2000:],
        "returncode": result.returncode
    }

# Common probe categories for a standard security baseline:
scan = run_security_scan(
    model_name="gpt-4o",
    probe_categories=[
        "dan.Dan_11_0",      # jailbreak: Do Anything Now variant
        "encoding",          # encoding-based bypasses (base64, ROT13)
        "continuation",      # harmful text continuation
        "promptinject",      # prompt injection attacks
        "leakreplay",        # training data memorization
    ]
)
print(f"Scan complete. Exit: {scan['returncode']}")

PAIR (Prompt Automatic Iterative Refinement) — LLM-based

A separate attacker LLM iteratively refines jailbreak prompts. Each iteration, the attacker sees the target model's response and generates a better attack.

Highly scalable (automated)
Good at semantic/creative jailbreaks
Moderate transferability across models
Medium compute cost

GCG (Greedy Coordinate Gradient) — gradient-based

Optimize adversarial suffix at the token level via gradient descent. Transfer well across models. Finds token-level exploits that semantic methods miss.

High transferability (same suffix works on different models)
Finds exploits that semantic attacks miss
Medium compute cost
Less human-interpretable attacks

Fuzzing — random perturbations

Generate random variations of inputs and check for failures. Low sophistication but catches edge cases that structured tests miss.

Very high scale (generate millions of variants)
Finds edge cases and unexpected interactions
Low transferability
High false-positive rate

Method	Scale	Transferability	Finds	Effort
Manual expert	Low	N/A	Creative, novel attacks	High
Automated LLM (PAIR)	High	Medium	Semantic jailbreaks	Low
Gradient-based (GCG)	Medium	High (transfers)	Token-level exploits	Medium
Fuzzing	Very high	Low	Edge cases	Low

💡 Best practice: Combine methods. Manual red-teaming finds creative attacks; PAIR scales semantic attacks; GCG finds transferable exploits; fuzzing catches edge cases. Run all in parallel during development.

04 — Deployment

Runtime Guardrails

Train-time alignment is necessary but insufficient. Runtime filters provide defense in depth — a safety net that catches violations that slipped through training. Input guardrails catch attacks before they reach the model; output guardrails catch policy violations before they reach users.

Guardrail Methods

Regex & Keyword Filters — fast, brittle

Pattern match against known bad inputs/outputs. Zero latency. Easily bypassed by paraphrasing.

Good for: Known patterns (credit card numbers, known slurs)
Pros: Fast, deterministic, no model calls
Cons: Brittle, high false-positive rate, easy to bypass
Use for: First-pass filtering, not primary defense

ML Classifiers — learned patterns

Train a small BERT-class model to classify harmful intent. Good precision on known categories. Misses novel attacks.

Good for: Toxicity, hate speech, known harm categories
Pros: Generalizes beyond exact keywords
Cons: Limited to training data; novel attacks slip through
Use for: Reliable filtering on known harm types

LLM-as-Judge — semantic understanding

Use a separate (smaller) LLM to evaluate I/O pairs. High quality understanding. Adds latency + cost.

Good for: Complex policy violations, context-aware harms
Pros: Semantic understanding, handles novel cases
Cons: Adds 500ms+ latency, increases cost
Use for: Critical systems where latency is acceptable

Prompt Injection Detection — special case

Detect instructions embedded in user content or retrieved context attempting to override system prompt.

Techniques: Semantic similarity to system prompt keywords, instruction phrases
Tools: Rebuff, Guardrails AI (DAN detection)
Challenge: True positives hard to distinguish from legitimate content

Guardrail Tools

Framework

LlamaGuard

Meta's safety classifier for input/output toxicity.

Framework

NeMo Guardrails

NVIDIA's guardrail framework with modular rules.

Framework

Guardrails AI

Python framework for building structured guardrails.

Specialization

Rebuff

Prompt injection detection and mitigation.

API

Azure Content Safety

Managed content classification API.

API

Perspective API

Google's toxicity and bias classifier.

05 — Attacks

Jailbreak Patterns and Defenses

Jailbreaks are creative prompts that bypass safety training. Understanding common patterns helps you defend against them. No single defense is complete — you must defend against multiple categories simultaneously.

Major Jailbreak Categories

Role-play attacks: "Pretend you are DAN (Do Anything Now)" — the model assumes a persona that ignores safety rules. Indirect injection: Hide instructions in documents the model is asked to summarize. Many-shot jailbreaking: Fill context with examples of the model complying with harmful requests. Multilingual bypass: Request harmful content in low-resource languages where safety training is weaker. Encoding bypass: ROT13, Base64, or pig Latin to evade keyword filters.

Jailbreak type	Example	Defense	Effectiveness
Role-play	"pretend you are DAN"	Training robustness	High
Indirect injection	Malicious text in retrieved doc	Separation of system/user/context	Medium
Many-shot	100 examples of compliance	Constitutional training	Medium
Encoding	Base64 harmful request	Decode before filtering	High
Multilingual	Request in Swahili	Multilingual safety training	Medium

⚠️ Critical: No single defense is complete. Defense-in-depth is the only robust approach. Train-time alignment + runtime guardrails + monitoring + incident response.

06 — Data

Privacy and Memorization

LLMs memorize training data. Verbatim reproduction of training text is measurable and extractable. Extraction attacks prompt the model to repeat memorized content. Defenses reduce memorization but don't eliminate it.

Extraction Attacks

Prefix attack: Prompt: "Repeat the following text:" + random prefix. If the model completes with memorized content, the attack succeeds. Membership inference: Can an adversary determine if a specific text was in the training set? Yes, at rates above random using perplexity measurements.

Memorization Defenses

Deduplication: Remove duplicate training examples before training. Simple and effective — reduces memorization by 10× (Lee et al. 2022). Differential privacy: Add noise during training. More principled but lower impact at scale. Output filtering: Detect and filter PII patterns in model outputs. Catches obvious leakage but misses creative extractions.

Memorization measurement: Exact memorization rate (at scale): - Large models: 1-5% of training set reproduced verbatim - Smaller models: 0.1-1% - Deduplication reduces this by ~10× Membership inference (probabilistic): - Perplexity difference between in-distribution and out - Not 100% accurate but above random

⚠️ Practical insight: Deduplication is simpler and more effective than differential privacy at realistic scales. Deduplicate your training data ruthlessly. Differential privacy adds significant training overhead for modest gains.

07 — Quality

Safety Evaluation and Red Lines

Safety evaluation requires defining what "safe" means. Absolute red lines (universal violations) are rare. Most policies exist on a spectrum with precision/recall tradeoffs. A model that refuses benign medical questions harms users as much as one that gives harmful advice.

Evaluation Benchmarks

HELM (Holistic Evaluation of Language Models): Includes toxicity, bias, and disinformation benchmarks. Evaluates real use cases with safety constraints. BBQ (Bias Benchmark for QA): Measures social bias in question answering across demographic groups. ToxiGen: Tests for implicit toxicity and bias against minority groups — harder than explicit toxicity.

Policy Framework

Define three levels: Red lines (universal): CSAM, detailed synthesis instructions for mass-casualty weapons, targeted harassment campaigns. High precision: Requests for harmful instructions; require high confidence before refusal. Gray area: Controversial topics; balance informativeness vs. harm. Track refusal rates by category — over-refusal is also a failure.

✓ Key insight: Over-refusal is a safety failure. If your model refuses benign medical, legal, or educational questions to avoid false positives, you're harming users. Measure refusal rates as a safety metric.

Safety Evaluation Tools

Benchmark

HELM

Holistic evaluation including safety constraints.

Benchmark

BBQ

Bias benchmark for QA systems.

Benchmark

ToxiGen

Implicit toxicity and bias evaluation.

Framework

Safety Taxonomy

Custom red lines and policy definitions.

08 — Further Reading

References

Academic Papers

Paper Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. arXiv:2212.08073. — arxiv:2212.08073 ↗
Paper Pal, A. et al. (2023). LlamaGuard: LLM-based Input-Output Safeguard for Open-Sourced Large Language Models. Meta. arXiv:2312.06674. — arxiv:2312.06674 ↗
Paper Zou, A. et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG). arXiv:2307.15043. — arxiv:2307.15043 ↗
Paper Chao, P. et al. (2023). PAIR: Prompt Automatic Iterative Refinement. arXiv:2310.03684. — arxiv:2310.03684 ↗
Paper Lee, K. et al. (2022). Deduplicating Training Data Mitigates Privacy Risks in Language Models. arXiv:2202.06539. — arxiv:2202.06539 ↗

Frameworks & Guides

Guide Anthropic. Constitutional AI: Harmlessness from AI Feedback. Blog post. anthropic.com ↗
Docs Meta. LlamaGuard: Safety Classifier for LLM Outputs. github.com/facebookresearch ↗
Docs Rebuff. Prompt Injection Detection & Prevention. docs.rebuff.ai ↗
Guide Guardrails AI. Building Safe LLM Applications. guardrailsai.com ↗

LLM Safety Techniques

The Safety Problem Space

Safety Failure Taxonomy

Constitutional AI and RLAIF

CAI Process

Constitution — principles guide training

Critique — model self-evaluates

Revision — model self-improves

Fine-Tune — supervised learning on revisions

Red-Teaming and Adversarial Evaluation

Red-Teaming Approaches

Manual Red-Teaming — expert judgment

PAIR (Prompt Automatic Iterative Refinement) — LLM-based

GCG (Greedy Coordinate Gradient) — gradient-based

Fuzzing — random perturbations

Runtime Guardrails

Guardrail Methods

Regex & Keyword Filters — fast, brittle

ML Classifiers — learned patterns

LLM-as-Judge — semantic understanding

Prompt Injection Detection — special case

Guardrail Tools

Jailbreak Patterns and Defenses

Major Jailbreak Categories

Privacy and Memorization

Extraction Attacks

Memorization Defenses

Safety Evaluation and Red Lines

Evaluation Benchmarks

Policy Framework

Safety Evaluation Tools

References

Related concepts