Reflection / Critique

What is reflection in agents
The reflection loop
Implementing self-critique
Reflexion: the formal pattern
When to use reflection
Multi-agent critique
Gotchas

SECTION 01

What is reflection in agents

Human experts don't submit first drafts. A good lawyer reads their brief again before filing. A careful programmer runs tests before shipping. Reflection gives agents the same discipline: before returning an answer, the model evaluates its own output against explicit criteria and revises if needed.

This is different from simply asking the model to "think carefully" in the prompt. Reflection is a structured loop: generate → critique → revise → (loop or finalise). The critique step uses a separate prompt (or even a separate model) with specific evaluation criteria: correctness, completeness, tone, factual accuracy, code correctness, etc.

Reflection trades latency and cost (extra LLM calls) for quality. It's most valuable for tasks where a wrong answer is expensive: medical information, legal analysis, financial projections, production code.

SECTION 02

The reflection loop

┌─────────────────────────────────────┐
│            User task                │
└──────────────┬──────────────────────┘
               │
       ┌───────▼───────┐
       │   Generate    │  ← Actor: write the initial response
       └───────┬───────┘
               │
       ┌───────▼───────┐
       │   Critique    │  ← Critic: evaluate against criteria
       └───────┬───────┘
               │
     ┌─────────▼──────────┐
     │  Needs revision?   │
     └──┬──────────────┬──┘
        │ YES          │ NO
  ┌─────▼─────┐   ┌────▼────┐
  │  Revise   │   │ Output  │
  └─────┬─────┘   └─────────┘
        │
        └── loop back to Critique
            (until max_rounds or approved)

The key design decision is the critique criteria. Generic criteria ("is this good?") produce weak critiques. Specific criteria ("does the code handle edge cases? are all factual claims verifiable? is the tone appropriate for a non-technical audience?") produce actionable feedback.

SECTION 03

Implementing self-critique

import anthropic

client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-5"

def generate(task: str) -> str:
    resp = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        messages=[{"role": "user", "content": task}]
    )
    return resp.content[0].text

def critique(task: str, response: str, criteria: list[str]) -> str:
    criteria_str = "
".join(f"- {c}" for c in criteria)
    prompt = f'''You are a rigorous critic. Evaluate the following response to
the task against the criteria below. Be specific about what is wrong and
how it should be improved. If the response is good, say "APPROVED".

Task: {task}

Response:
{response}

Evaluation criteria:
{criteria_str}

Critique:'''
    resp = client.messages.create(
        model=MODEL,
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.content[0].text

def revise(task: str, response: str, critique_text: str) -> str:
    prompt = f'''Revise the following response based on the critique.

Task: {task}

Original response:
{response}

Critique:
{critique_text}

Improved response:'''
    resp = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.content[0].text

def reflection_agent(task: str, criteria: list[str], max_rounds: int = 3) -> str:
    response = generate(task)
    for i in range(max_rounds):
        critique_text = critique(task, response, criteria)
        if "APPROVED" in critique_text:
            print(f"Approved after {i+1} round(s)")
            break
        print(f"Round {i+1}: revising...")
        response = revise(task, response, critique_text)
    return response

# Example
result = reflection_agent(
    task="Explain how HTTPS works to a 10-year-old.",
    criteria=[
        "Uses no jargon (TLS, certificate, handshake are jargon)",
        "Includes a concrete analogy",
        "Under 150 words",
        "Accurate — no factually wrong statements"
    ]
)
print(result)

SECTION 04

Reflexion: the formal pattern

Reflexion (Shinn et al., 2023) formalises self-critique for agents. Unlike ad-hoc reflection, it adds a memory component: after each failed attempt, the agent writes a short "lesson learned" that gets prepended to future attempts. This prevents the agent from repeating the same mistake.

reflexion_memory = []

def reflexion_agent(task: str, max_attempts: int = 4) -> str:
    for attempt in range(max_attempts):
        # Build context with accumulated lessons
        memory_str = ""
        if reflexion_memory:
            lessons = "
".join(f"- {m}" for m in reflexion_memory)
            memory_str = f"Previous attempts failed. Lessons learned:
{lessons}

"

        prompt = memory_str + task
        response = generate(prompt)

        # Evaluate (replace with domain-specific check)
        critique_text = critique(task, response, criteria=["Is this correct?"])
        if "APPROVED" in critique_text:
            return response

        # Extract lesson learned
        lesson_prompt = f"In one sentence, what mistake was made and how to avoid it:
{critique_text}"
        lesson = client.messages.create(
            model=MODEL, max_tokens=100,
            messages=[{"role": "user", "content": lesson_prompt}]
        ).content[0].text
        reflexion_memory.append(lesson)

    return response  # return best attempt

SECTION 05

When to use reflection

Use reflection when: the output quality is high-stakes (medical, legal, financial, production code), when first-pass quality from your model is noticeably below what's needed, or when you have domain-specific quality criteria that can be checked programmatically or via a critic prompt.

Don't use reflection when: latency is critical (each round adds 1-2 LLM calls), the task is simple enough that a well-crafted single prompt suffices, or when the model already performs at ceiling for that task (adding critique loops won't help).

A practical middle ground: run reflection only for outputs that score below a threshold on a quick automated check. For example, generate code → run the tests → if tests fail, trigger the reflection loop. If tests pass, skip it. This targets the reflection cost only where it's needed.

SECTION 06

Multi-agent critique

Instead of the same model critiquing itself, use a separate "critic" model or agent with a different system prompt. This avoids the confirmation bias where the model approves its own flawed output:

def multi_agent_reflection(task: str) -> str:
    # Actor: optimistic, creative
    actor_system = "You are a creative writer focused on engaging, vivid prose."
    # Critic: strict, factual
    critic_system = "You are a fact-checker and editor. Be ruthless about errors."

    response = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        system=actor_system,
        messages=[{"role": "user", "content": task}]
    ).content[0].text

    critique_prompt = f"Review this response for factual errors and improvements:
{response}"
    critique_text = client.messages.create(
        model=MODEL,
        max_tokens=512,
        system=critic_system,
        messages=[{"role": "user", "content": critique_prompt}]
    ).content[0].text

    # Final synthesis
    final_prompt = f"Original: {response}

Critique: {critique_text}

Write the improved version:"
    return client.messages.create(
        model=MODEL, max_tokens=1024,
        messages=[{"role": "user", "content": final_prompt}]
    ).content[0].text

SECTION 07

Gotchas

Critique loops can converge on mediocrity. After multiple rounds of "make it more concise / add more detail / be more formal / be more friendly", the model can end up with a generic, committee-designed response that's technically correct but dull. Set a maximum of 2-3 rounds and stop even if the critic isn't fully satisfied.

The critic needs criteria, not vibes. "Improve the response" is useless. "The code must handle None inputs without raising an exception" or "the explanation must use no more than two jargon terms" gives the critic something actionable to check.

Cost scales with rounds. 3 rounds of reflection = 4-7x the token cost of a single generation (generate + critique + revise, × 3). Monitor costs carefully and only apply reflection to the outputs that need it.

Programmatic checks beat LLM critics for code. Running unit tests, a linter, or type checker is faster, cheaper, and more reliable than asking an LLM if code is correct. Use LLM critique for qualitative aspects (clarity, tone, completeness) and automated tools for objective correctness.

Pattern	Who Critiques	Iterations	Best For	Cost
Self-reflection	Same model, same prompt	1–3	Factual checking, format cleanup	Low
Self-reflection (chain)	Same model, critique prompt	2–5	Reasoning quality, argument strength	Medium
Reflexion (with memory)	Same model, episodic memory	3–10	Agent task improvement across episodes	Medium-high
Multi-agent critique	Separate critic model/role	1–3	Bias detection, adversarial review	High

Reflection / Critique

Table of Contents

What is reflection in agents

The reflection loop

Implementing self-critique

Reflexion: the formal pattern

When to use reflection

Multi-agent critique

Gotchas

Reflection Pattern Comparison

Reflection / Critique

Table of Contents

What is reflection in agents

The reflection loop

Implementing self-critique

Reflexion: the formal pattern

When to use reflection

Multi-agent critique

Gotchas

Reflection Pattern Comparison

Related concepts