01 — Foundation
The Anatomy of a Production-Grade Prompt
Prompting is the primary interface between you and an LLM. A well-designed prompt can make a smaller model outperform a larger one. Poor prompts make even the best models inconsistent. A production-grade prompt has four core components:
The Four Components
| Component | Purpose | Example |
| System | Role, constraints, output format | "You are a senior Python engineer. Always use type hints. Output JSON." |
| Examples | Show the exact format and tone you want | 2–5 input/output pairs demonstrating the task |
| Task | Clear, specific, scoped instruction | "Refactor this function to use a dict instead of a loop" |
| Output Format | Explicit structure of the response | "Return JSON: {code, explanation, test_case}" |
Not every prompt needs all four, but production systems use all of them. Examples are so powerful that 2–3 worked examples often beat 500 words of instruction.
✓
Pro tip: The fastest quality improvement: add 2–3 worked examples. Models learn format and tone from examples faster than from instructions.
Prompt Structure in Practice
SYSTEM — role, constraints, format:
"You are a senior Python engineer. Always use type hints.
Always add docstrings. Output JSON with keys: code, reason, time_complexity."
EXAMPLES — show the format you want:
Input: "Optimize this loop: for i in range(n): sum += arr[i]"
Output: {"code": "sum(arr)", "reason": "Built-in is faster", "time_complexity": "O(n) → O(n)"}
Input: "Add type hints to this function: def add(a, b): return a + b"
Output: {"code": "def add(a: int, b: int) -> int: return a + b", ...}
TASK — clear, specific, scoped:
"Refactor this function to use a dictionary for O(1) lookups instead of a list."
OUTPUT FORMAT — what you want back:
"Return JSON with keys: refactored_code, original_complexity, new_complexity, explanation"
This structure is discipline. It forces you to be explicit about what you want, which forces the model to be consistent in what it delivers.
02 — Techniques
Techniques Ranked by Impact
Not all prompting techniques are equal. Some give you 2–3x quality gains; others give 5–10% gains. Here's the ranking by impact:
1
Few-Shot Examples (Highest Leverage)
Give the model 2–5 worked examples of input→output. Models learn the format, tone, and implicit constraints from examples faster than from text instructions.
- Quality gain: 2–5x on structured tasks
- Why: LLMs are pattern matchers. Examples are patterns.
- When to use: Always, for any production task
2
Chain-of-Thought (For Reasoning)
Ask the model to show its reasoning step-by-step before answering. "Think step by step" or "Reasoning:" forces the model to decompose the problem.
- Quality gain: 10–30% on math, logic, multi-step reasoning
- Why: Latent reasoning emerges when you force verbalization
- When to use: Math, logic, causality, multi-hop problems
3
System Role (For Consistency)
Set the system prompt with a role: "You are a senior data scientist." The role shapes tone, vocabulary, and constraints without explicit instruction.
- Quality gain: 5–15% on consistency and expertise
- Why: Role primes a particular behavioral distribution
- When to use: Any task where tone/expertise matters
4
Output Schema (For Reliability)
Explicitly define the output structure as JSON Schema or XML. "Return JSON with keys: {field1, field2}". Structured outputs are more parseable and predictable.
- Quality gain: 5–10% on parsing reliability
- Why: Reduces ambiguity in output format
- When to use: Any task consuming downstream
5
Self-Consistency (For Accuracy)
Generate multiple independent completions and take the majority vote. Samples the distribution to find high-likelihood answers.
- Quality gain: 5–15% on hard tasks, costs 3–5x
- Why: Reduces variance of a single sample
- When to use: High-stakes, reasoning-heavy tasks where cost permits
ℹ️
Combination matters: Few-shot + Chain-of-Thought on a math task often outperforms either alone. Test combinations for your specific task.
03 — Debugging
Common Failure Modes and Fixes
Inconsistent Outputs
Problem: Same prompt gives wildly different answers on the same input. Fix: Add 2–3 examples showing the exact format. Examples anchor the output distribution.
Format Errors
Problem: Model returns free text when you asked for JSON. Fix: (1) Add an example in the target format. (2) Use structured output modes (Claude's JSON mode). (3) Repeat the format instruction in the task itself: "Return ONLY valid JSON, no markdown."
Hallucination / Fabrication
Problem: Model invents facts or code that don't exist. Fix: (1) Add examples showing confident uncertainty. (2) Add "Say 'I don't know' if unsure." (3) Use retrieval-augmented generation (RAG). (4) Constrain the answer space: "Choose one of: A, B, C."
Incomplete Responses
Problem: Model cuts off mid-response, especially for long outputs. Fix: (1) Increase max_tokens. (2) Break the task into smaller subtasks. (3) Use streaming to detect truncation early.
Off-Topic Tangents
Problem: Model answers something related but not what you asked. Fix: (1) Be more specific in the task statement. (2) Add a "focus" example that shows depth on the right topic. (3) Add a negative example: "Do NOT explain the history of..."
Tone Mismatch
Problem: Output is too formal, too casual, too verbose. Fix: Use the system role and examples. "You are a concise technical writer" + an example of concise output beats "Be concise."
⚠️
Debugging mindset: Treat prompt tuning like a science: change one thing at a time, test on 3–5 examples, measure the change. It's easy to make a prompt "better" on your favorite example while breaking others.
04 — Implementation
Working Code Examples
1. Zero-Shot Baseline
No examples. Just ask. Useful as a baseline to compare against. Usually the weakest approach on structured tasks.
from anthropic import Anthropic
client = Anthropic()
# Zero-shot: no examples, no structure
response = client.messages.create(
model='claude-haiku-4-5-20251001',
max_tokens=64,
messages=[{
'role': 'user',
'content': 'Classify sentiment: "The product broke on day 1."'
}]
)
print(response.content[0].text)
2. Few-Shot Classification
Add 2–3 examples showing the format and boundaries. Watch the quality jump.
FEW_SHOT = """Classify sentiment (positive/negative/neutral):
"Great value" -> positive
"Average experience" -> neutral
"Never buying again" -> negative
"The product broke on day 1." ->"""
response = client.messages.create(
model='claude-haiku-4-5-20251001',
max_tokens=8,
messages=[{'role': 'user', 'content': FEW_SHOT}]
)
print(response.content[0].text)
3. Chain-of-Thought (Reasoning)
For logic or math, ask the model to think step-by-step. This unlocks multi-step reasoning.
response = client.messages.create(
model='claude-opus-4-5',
max_tokens=512,
messages=[{
'role': 'user',
'content': (
'A store sells 3 apples for $1. I buy 12. '
'Think step by step, then state the total cost.'
)
}]
)
print(response.content[0].text)
4. System Prompt Persona
Use the system parameter to set role and constraints. This shapes behavior more subtly than in-message instructions.
response = client.messages.create(
model='claude-haiku-4-5-20251001',
max_tokens=256,
system=(
'You are a senior Python engineer. '
'Be concise. Always include code examples. '
'Use type hints.'
),
messages=[{
'role': 'user',
'content': 'How do I reverse a list?'
}]
)
print(response.content[0].text)
5. Structured JSON Output
Combine few-shot + output schema for maximum reliability. State the schema twice: once in the system, once in the task.
system = """You are a code reviewer. Be critical but fair.
Always respond with valid JSON:
{"verdict": "pass|needs_revision|fail", "issues": [...], "suggestion": "..."}"""
user_task = """Code to review:
def add(a, b):
return a + b
Respond with JSON. No markdown."""
response = client.messages.create(
model='claude-opus-4-5',
max_tokens=256,
system=system,
messages=[{'role': 'user', 'content': user_task}]
)
print(response.content[0].text)
05 — Discipline
Prompt Versioning and Testing
Prompts are code. Version them. Test them. Track their performance. A single poorly tuned prompt in production can silently degrade your entire product.
Three-Tier Testing
🧪 Unit Tests (3–5 examples)
- Test edge cases: empty input, long input, special chars
- Test boundaries: happy path + one failure case
- Automate: run on every prompt change
📊 Benchmark (20–50 examples)
- Represents real traffic distribution
- Measure: accuracy, latency, cost
- Track over time as you iterate
🚀 Production Monitor (real traffic)
- Sample outputs, grade by hand
- Alert if quality dips >5%
- Have a rollback plan (previous prompt version)
A/B Test
- New prompt vs. old on 5–10% of traffic
- Measure delta in quality metric
- Deploy only if win is statistical & material
Version Control Strategy
Store prompts as code (not as YAML magic strings). Example structure:
prompts/
├── sentiment_classifier/
│ ├── v1.0.txt (baseline: zero-shot)
│ ├── v1.1.txt (added 2 examples → +15% accuracy)
│ ├── v1.2.txt (changed wording → +2% accuracy)
│ └── v2.0.txt (current: few-shot + structured output)
├── code_reviewer/
│ ├── v1.0.txt
│ └── v1.1.txt
└── test_cases.jsonl (3-5 examples per task)
A/B Testing Workflow
1
Write candidate prompt. Based on failure analysis, add examples or examples or restructure.
2
Test on benchmark. Run on 20–50 test cases. Score. Compare to baseline.
3
If improvement >5%: Deploy to 5–10% of traffic for 1 week.
4
Monitor metrics. Manual spot-checks + automated scoring. Alert if quality dips.
5
5
If stable, roll out to 100%. Keep old prompt version as rollback. Commit to version control with date and A/B results.
✓
Best practice: Keep 2–3 versions of each prompt in production. If the current version breaks, rollback immediately instead of debugging live.
07 — Further Reading
References
Academic Papers
-
Paper
Brown, T. et al. (2020).
Language Models are Few-Shot Learners.
arXiv:2005.14165. —
arxiv:2005.14165 ↗
-
Paper
Wei, J. et al. (2022).
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
arXiv:2201.11903. —
arxiv:2201.11903 ↗
-
Paper
Liu, P. et al. (2023).
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.
arXiv:2107.13586. —
arxiv:2107.13586 ↗
Official Documentation
Further Exploration
8
Prompt Chaining and Pipelines
Single prompts rarely solve complex tasks. Chaining breaks a hard problem into sequential sub-tasks, each with a focused prompt. The output of one call becomes the input to the next. Chains are deterministic (no branching) and easy to debug, making them the right default before reaching for agents.
Common chain patterns: Decompose → Solve → Aggregate (split a big question, answer each part, merge), Draft → Critique → Revise (generate then self-correct), and Extract → Transform → Format (parse raw input, process, structure output). DSPy and LangChain Expression Language both formalise chains, but plain Python function calls work just as well for short chains.
import openai
client = openai.OpenAI()
def call(prompt, system="You are a helpful assistant.", model="gpt-4o-mini"):
return client.chat.completions.create(
model=model,
messages=[{"role":"system","content":system},
{"role":"user","content":prompt}]
).choices[0].message.content
# Chain: Decompose → Solve → Synthesise
def answer_complex(question: str) -> str:
# Step 1: decompose into sub-questions
sub_qs = call(
f"Break this question into 3 focused sub-questions:
{question}",
system="Output only a numbered list."
)
# Step 2: answer each sub-question
answers = call(
f"Answer each sub-question concisely:
{sub_qs}",
system="Be factual and brief."
)
# Step 3: synthesise final answer
final = call(
f"Original question: {question}
Sub-answers:
{answers}
Write a concise final answer.",
system="Synthesise the sub-answers into a single, coherent response."
)
return final
print(answer_complex("How should I choose between RAG and fine-tuning for a customer support bot?"))
| Pattern | Use when | Trade-off |
| Sequential chain | Steps are dependent and linear | Latency adds up; errors propagate |
| Parallel fan-out | Sub-tasks are independent | Lower latency; needs merge step |
| Draft → critique | Quality matters more than speed | 2× LLM calls; usually worth it |
| Agent loop | Path is unknown at design time | Unpredictable steps; harder to debug |