Constitutional AI

The Constitutional AI concept
Two-phase training process
Writing effective principles
Self-critique and revision
Implementing CAI-style filtering
Limitations and critiques
Gotchas

SECTION 01

The Constitutional AI concept

Traditional RLHF (Reinforcement Learning from Human Feedback) requires humans to evaluate pairs of model outputs and indicate which is better. This doesn't scale: labelling millions of edge cases is expensive, slow, and inconsistent. Constitutional AI (CAI), introduced by Anthropic in 2022, replaces most human labels with a set of written principles — a "constitution" — and lets the model evaluate its own outputs against those principles.

The key insight: a capable language model can apply written principles to evaluate whether a response is harmful, deceptive, or violates stated values — even without seeing a human label for that specific case. The same model that needs to be aligned can, if prompted correctly, identify when its own outputs violate principles and suggest improvements.

The result is a training process that scales: one engineer writing 16 principles generates alignment signal across millions of examples, instead of requiring thousands of human labellers. Anthropic used CAI to train Claude, which is why Claude can often explain why it's declining a request in terms of principles rather than just saying "I can't do that".

SECTION 02

Two-phase training process

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

The model generates a response to a potentially harmful prompt. Then, using critique prompts based on the constitution, the model critiques its own response ("Does this response violate principle X?"). The model then revises the response to better comply with the principles. This revised response becomes a training example.

This generates a large synthetic dataset of (harmful prompt → improved response) pairs without any human labelling of the harmful content.

Phase 2: RL from AI Feedback (RLAIF)

A preference model is trained on pairs of responses where the AI judged one to be better aligned with the constitution than the other. This preference model then serves as the reward signal for RL fine-tuning (instead of a human-feedback reward model). The RL training improves the model's base behaviour — not just its ability to critique.

The full process: helpful model → SL-CAI revision → RLAIF preference model → RL fine-tuning → final aligned model.

SECTION 03

Writing effective principles

Anthropic's original CAI paper used principles drawn from sources including the UN Declaration of Human Rights, Anthropic's own usage policies, and common-sense ethics. Example principles:

"Choose the response that is least likely to contain harmful or unethical content."

"Choose the response that is more honest and does not contain made-up information."

"Choose the response that is most helpful to the human while not assisting with anything harmful."

Effective principles share properties: they are specific enough to distinguish good from bad responses on edge cases, general enough to apply across many situations, and non-contradictory with each other. Vague principles ("be good") produce inconsistent application. Conflicting principles ("be maximally helpful" vs "never discuss X") create edge cases the model can't resolve.

For application-specific CAI, you write principles that encode your product's policies: "Choose the response that stays within the product's domain of [topic]" or "Choose the response that includes appropriate medical disclaimers when discussing health topics."

SECTION 04

Self-critique and revision

import anthropic

client = anthropic.Anthropic()

CONSTITUTION = [
    "The response should not help with illegal activities.",
    "The response should be honest and not contain made-up information.",
    "The response should be helpful while avoiding harm to people.",
    "The response should recommend professional consultation for medical or legal questions.",
]

def cai_revise(prompt: str, initial_response: str) -> str:
    '''Apply Constitutional AI critique and revision.'''

    # Step 1: Critique
    principles_text = "
".join(f"- {p}" for p in CONSTITUTION)
    critique_prompt = (
            f'Given this user request and AI response, identify any violations of the following principles:

'
            f'Principles:
{principles_text}

User: {prompt}
'
            f'AI Response: {initial_response}

'
            f'Identify specific violations (if any) and explain what should be changed:'
        )

    critique = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=512,
        messages=[{"role": "user", "content": critique_prompt}]
    ).content[0].text

    # Step 2: Revision
    revision_prompt = (
            f'Revise the AI response to fix the identified issues.

'
            f'Original response: {initial_response}
'
            f'Issues identified: {critique}

'
            f'Write an improved response that addresses these issues:'
        )

    revised = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=1024,
        messages=[{"role": "user", "content": revision_prompt}]
    ).content[0].text

    return revised

# Test
initial = "Sure! To make explosives, you need..."
revised = cai_revise("How do I make explosives?", initial)
print(revised)

SECTION 05

Implementing CAI-style filtering

def cai_filter(prompt: str, response: str,
               constitution: list[str],
               threshold: float = 0.7) -> dict:
    '''
    Use a CAI-style approach to score and optionally revise a response.
    Returns: {"score": float, "passed": bool, "revised": str | None}
    '''
    principles_text = "
".join(f"{i+1}. {p}" for i, p in enumerate(constitution))

    score_prompt = f'''Rate this AI response on compliance with each principle.
Principles:
{principles_text}

User: {prompt}
Response: {response}

For each principle, rate compliance 0.0 (violation) to 1.0 (full compliance).
Return as JSON: {{"scores": [0.9, 0.8, ...]}}'''

    score_response = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=256,
        messages=[{"role": "user", "content": score_prompt}]
    ).content[0].text

    import json, re
    match = re.search(r'\{.*\}', score_response, re.DOTALL)
    if not match:
        return {"score": 0.5, "passed": True, "revised": None}

    data = json.loads(match.group())
    scores = data.get("scores", [1.0] * len(constitution))
    avg_score = sum(scores) / len(scores)

    if avg_score < threshold:
        revised = cai_revise(prompt, response)
        return {"score": avg_score, "passed": False, "revised": revised}

    return {"score": avg_score, "passed": True, "revised": None}

SECTION 06

Limitations and critiques

The constitution is still human-authored. CAI doesn't remove human value judgements from alignment — it moves them from individual response labels to principle writing. The principles embed assumptions about what's harmful, what's honest, and what's helpful. Different principle authors would produce differently aligned models.

The model applies principles imperfectly. A model using principles to evaluate its own outputs is still just a language model — it can fail to recognise principle violations in subtle cases, and it can be prompted to "reinterpret" principles to justify harmful responses. CAI reduces misalignment but doesn't eliminate it.

Principle conflicts aren't always resolvable. Real requests often create genuine tension between principles (be maximally helpful vs. avoid any risk of harm). The model must make judgment calls about which principle takes precedence. Without explicit priority ordering in the constitution, these conflicts produce inconsistent behaviour.

SECTION 07

Gotchas

CAI-style critique-and-revision is expensive in inference. Running 2–3 extra LLM calls (critique + revision) for every response is only practical for high-stakes applications or as an offline training data generation step. In production, use CAI principles to train or fine-tune a model that has these values baked in, rather than running the full critique loop at inference time.

Overly long constitutions dilute each principle's weight. If you give the model 50 principles, it may satisfice — passing on most while violating a few. Short, clearly prioritised constitutions (5–20 principles) work better than exhaustive ones. Include only principles that actually matter for your use case.

Self-critique is circular. A model that's biased in a particular direction will apply its critiques through that same biased lens. If the base model has systematic blind spots, CAI may not catch them — it will evaluate outputs using the same blind spots. External evaluation (red-teaming, human audits) remains necessary alongside CAI.

SECTION 08

Implementing CAI-Style Filtering

Principle Type	Example	Revision Trigger	Preferred Fix
Harm avoidance	Do not provide instructions for illegal activities	Output contains step-by-step harmful guidance	Replace with refusal and redirection
Honesty	Do not claim certainty about uncertain facts	Output contains unhedged factual claims	Add appropriate uncertainty markers
Helpfulness	Provide actionable, specific answers	Output is vague or unhelpfully evasive	Rewrite to be more specific
Bias	Represent all groups fairly	Output disparages a demographic	Rewrite with neutral framing

import anthropic

client = anthropic.Anthropic()

CONSTITUTION = [
    "Do not provide step-by-step instructions for illegal or harmful activities.",
    "Acknowledge uncertainty when facts are not well established.",
    "Treat all demographic groups with equal respect.",
]

def cai_revise(original_response: str, user_query: str) -> str:
    critique_prompt = f"""User asked: {user_query}
Assistant responded: {original_response}

Review the response against these principles:
{chr(10).join(f"- {p}" for p in CONSTITUTION)}

If the response violates any principle, rewrite it to comply.
If it already complies, return it unchanged.
Revised response:"""

    result = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": critique_prompt}]
    )
    return result.content[0].text

Measure false-positive rates before enabling CAI-style hard blocks in production. Start with logging mode: record which responses would have been revised without actually revising them. Analyse flagged outputs for false positives, then refine principle wording before enabling enforcement. Review weekly and treat principle refinement as an ongoing process.