An agent pattern where the model reviews and critiques its own output before finalising — acting as both author and editor to catch errors, improve quality, and reduce hallucinations.
Human experts don't submit first drafts. A good lawyer reads their brief again before filing. A careful programmer runs tests before shipping. Reflection gives agents the same discipline: before returning an answer, the model evaluates its own output against explicit criteria and revises if needed.
This is different from simply asking the model to "think carefully" in the prompt. Reflection is a structured loop: generate → critique → revise → (loop or finalise). The critique step uses a separate prompt (or even a separate model) with specific evaluation criteria: correctness, completeness, tone, factual accuracy, code correctness, etc.
Reflection trades latency and cost (extra LLM calls) for quality. It's most valuable for tasks where a wrong answer is expensive: medical information, legal analysis, financial projections, production code.
┌─────────────────────────────────────┐
│ User task │
└──────────────┬──────────────────────┘
│
┌───────▼───────┐
│ Generate │ ← Actor: write the initial response
└───────┬───────┘
│
┌───────▼───────┐
│ Critique │ ← Critic: evaluate against criteria
└───────┬───────┘
│
┌─────────▼──────────┐
│ Needs revision? │
└──┬──────────────┬──┘
│ YES │ NO
┌─────▼─────┐ ┌────▼────┐
│ Revise │ │ Output │
└─────┬─────┘ └─────────┘
│
└── loop back to Critique
(until max_rounds or approved)
The key design decision is the critique criteria. Generic criteria ("is this good?") produce weak critiques. Specific criteria ("does the code handle edge cases? are all factual claims verifiable? is the tone appropriate for a non-technical audience?") produce actionable feedback.
import anthropic
client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-5"
def generate(task: str) -> str:
resp = client.messages.create(
model=MODEL,
max_tokens=1024,
messages=[{"role": "user", "content": task}]
)
return resp.content[0].text
def critique(task: str, response: str, criteria: list[str]) -> str:
criteria_str = "
".join(f"- {c}" for c in criteria)
prompt = f'''You are a rigorous critic. Evaluate the following response to
the task against the criteria below. Be specific about what is wrong and
how it should be improved. If the response is good, say "APPROVED".
Task: {task}
Response:
{response}
Evaluation criteria:
{criteria_str}
Critique:'''
resp = client.messages.create(
model=MODEL,
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
return resp.content[0].text
def revise(task: str, response: str, critique_text: str) -> str:
prompt = f'''Revise the following response based on the critique.
Task: {task}
Original response:
{response}
Critique:
{critique_text}
Improved response:'''
resp = client.messages.create(
model=MODEL,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return resp.content[0].text
def reflection_agent(task: str, criteria: list[str], max_rounds: int = 3) -> str:
response = generate(task)
for i in range(max_rounds):
critique_text = critique(task, response, criteria)
if "APPROVED" in critique_text:
print(f"Approved after {i+1} round(s)")
break
print(f"Round {i+1}: revising...")
response = revise(task, response, critique_text)
return response
# Example
result = reflection_agent(
task="Explain how HTTPS works to a 10-year-old.",
criteria=[
"Uses no jargon (TLS, certificate, handshake are jargon)",
"Includes a concrete analogy",
"Under 150 words",
"Accurate — no factually wrong statements"
]
)
print(result)
Reflexion (Shinn et al., 2023) formalises self-critique for agents. Unlike ad-hoc reflection, it adds a memory component: after each failed attempt, the agent writes a short "lesson learned" that gets prepended to future attempts. This prevents the agent from repeating the same mistake.
reflexion_memory = []
def reflexion_agent(task: str, max_attempts: int = 4) -> str:
for attempt in range(max_attempts):
# Build context with accumulated lessons
memory_str = ""
if reflexion_memory:
lessons = "
".join(f"- {m}" for m in reflexion_memory)
memory_str = f"Previous attempts failed. Lessons learned:
{lessons}
"
prompt = memory_str + task
response = generate(prompt)
# Evaluate (replace with domain-specific check)
critique_text = critique(task, response, criteria=["Is this correct?"])
if "APPROVED" in critique_text:
return response
# Extract lesson learned
lesson_prompt = f"In one sentence, what mistake was made and how to avoid it:
{critique_text}"
lesson = client.messages.create(
model=MODEL, max_tokens=100,
messages=[{"role": "user", "content": lesson_prompt}]
).content[0].text
reflexion_memory.append(lesson)
return response # return best attempt
Use reflection when: the output quality is high-stakes (medical, legal, financial, production code), when first-pass quality from your model is noticeably below what's needed, or when you have domain-specific quality criteria that can be checked programmatically or via a critic prompt.
Don't use reflection when: latency is critical (each round adds 1-2 LLM calls), the task is simple enough that a well-crafted single prompt suffices, or when the model already performs at ceiling for that task (adding critique loops won't help).
A practical middle ground: run reflection only for outputs that score below a threshold on a quick automated check. For example, generate code → run the tests → if tests fail, trigger the reflection loop. If tests pass, skip it. This targets the reflection cost only where it's needed.
Instead of the same model critiquing itself, use a separate "critic" model or agent with a different system prompt. This avoids the confirmation bias where the model approves its own flawed output:
def multi_agent_reflection(task: str) -> str:
# Actor: optimistic, creative
actor_system = "You are a creative writer focused on engaging, vivid prose."
# Critic: strict, factual
critic_system = "You are a fact-checker and editor. Be ruthless about errors."
response = client.messages.create(
model=MODEL,
max_tokens=1024,
system=actor_system,
messages=[{"role": "user", "content": task}]
).content[0].text
critique_prompt = f"Review this response for factual errors and improvements:
{response}"
critique_text = client.messages.create(
model=MODEL,
max_tokens=512,
system=critic_system,
messages=[{"role": "user", "content": critique_prompt}]
).content[0].text
# Final synthesis
final_prompt = f"Original: {response}
Critique: {critique_text}
Write the improved version:"
return client.messages.create(
model=MODEL, max_tokens=1024,
messages=[{"role": "user", "content": final_prompt}]
).content[0].text
Critique loops can converge on mediocrity. After multiple rounds of "make it more concise / add more detail / be more formal / be more friendly", the model can end up with a generic, committee-designed response that's technically correct but dull. Set a maximum of 2-3 rounds and stop even if the critic isn't fully satisfied.
The critic needs criteria, not vibes. "Improve the response" is useless. "The code must handle None inputs without raising an exception" or "the explanation must use no more than two jargon terms" gives the critic something actionable to check.
Cost scales with rounds. 3 rounds of reflection = 4-7x the token cost of a single generation (generate + critique + revise, × 3). Monitor costs carefully and only apply reflection to the outputs that need it.
Programmatic checks beat LLM critics for code. Running unit tests, a linter, or type checker is faster, cheaper, and more reliable than asking an LLM if code is correct. Use LLM critique for qualitative aspects (clarity, tone, completeness) and automated tools for objective correctness.
| Pattern | Who Critiques | Iterations | Best For | Cost |
|---|---|---|---|---|
| Self-reflection | Same model, same prompt | 1–3 | Factual checking, format cleanup | Low |
| Self-reflection (chain) | Same model, critique prompt | 2–5 | Reasoning quality, argument strength | Medium |
| Reflexion (with memory) | Same model, episodic memory | 3–10 | Agent task improvement across episodes | Medium-high |
| Multi-agent critique | Separate critic model/role | 1–3 | Bias detection, adversarial review | High |
Reflection works best when the failure mode is "output is plausible but wrong in subtle ways" — factual errors, logical gaps, style inconsistencies. It does not help when the base model lacks the knowledge to generate a correct answer in the first place; in that case, retrieval or fine-tuning is the right fix. Monitor iteration count in production: if a reflection loop consistently runs to its maximum iterations without convergence, the task is likely under-specified or the critique prompt is too harsh. Add a confidence gate — if the model rates the output above a threshold on first pass, skip reflection entirely to save cost and latency.
Gate reflection behind a task-type check: apply it to open-ended generative tasks (writing, analysis, reasoning) but skip it on deterministic tasks like JSON extraction or classification where it adds latency without quality benefit. Use a lightweight self-consistency check instead for factual accuracy verification.