Prompting LLMs

In This Section

The Anatomy of a Prompt
Techniques Ranked by Impact
Common Failure Modes
Working Code Examples
Prompt Versioning & Testing
What to Explore Next
References

01 — Foundation

The Anatomy of a Production-Grade Prompt

Prompting is the primary interface between you and an LLM. A well-designed prompt can make a smaller model outperform a larger one. Poor prompts make even the best models inconsistent. A production-grade prompt has four core components:

The Four Components

Component	Purpose	Example
System	Role, constraints, output format	"You are a senior Python engineer. Always use type hints. Output JSON."
Examples	Show the exact format and tone you want	2–5 input/output pairs demonstrating the task
Task	Clear, specific, scoped instruction	"Refactor this function to use a dict instead of a loop"
Output Format	Explicit structure of the response	"Return JSON: {code, explanation, test_case}"

Not every prompt needs all four, but production systems use all of them. Examples are so powerful that 2–3 worked examples often beat 500 words of instruction.

✓ Pro tip: The fastest quality improvement: add 2–3 worked examples. Models learn format and tone from examples faster than from instructions.

Prompt Structure in Practice

SYSTEM — role, constraints, format: "You are a senior Python engineer. Always use type hints. Always add docstrings. Output JSON with keys: code, reason, time_complexity." EXAMPLES — show the format you want: Input: "Optimize this loop: for i in range(n): sum += arr[i]" Output: {"code": "sum(arr)", "reason": "Built-in is faster", "time_complexity": "O(n) → O(n)"} Input: "Add type hints to this function: def add(a, b): return a + b" Output: {"code": "def add(a: int, b: int) -> int: return a + b", ...} TASK — clear, specific, scoped: "Refactor this function to use a dictionary for O(1) lookups instead of a list." OUTPUT FORMAT — what you want back: "Return JSON with keys: refactored_code, original_complexity, new_complexity, explanation"

This structure is discipline. It forces you to be explicit about what you want, which forces the model to be consistent in what it delivers.

02 — Techniques

Techniques Ranked by Impact

Not all prompting techniques are equal. Some give you 2–3x quality gains; others give 5–10% gains. Here's the ranking by impact:

Few-Shot Examples (Highest Leverage)

Give the model 2–5 worked examples of input→output. Models learn the format, tone, and implicit constraints from examples faster than from text instructions.

Quality gain: 2–5x on structured tasks
Why: LLMs are pattern matchers. Examples are patterns.
When to use: Always, for any production task

Chain-of-Thought (For Reasoning)

Ask the model to show its reasoning step-by-step before answering. "Think step by step" or "Reasoning:" forces the model to decompose the problem.

Quality gain: 10–30% on math, logic, multi-step reasoning
Why: Latent reasoning emerges when you force verbalization
When to use: Math, logic, causality, multi-hop problems

System Role (For Consistency)

Set the system prompt with a role: "You are a senior data scientist." The role shapes tone, vocabulary, and constraints without explicit instruction.

Quality gain: 5–15% on consistency and expertise
Why: Role primes a particular behavioral distribution
When to use: Any task where tone/expertise matters

Output Schema (For Reliability)

Explicitly define the output structure as JSON Schema or XML. "Return JSON with keys: {field1, field2}". Structured outputs are more parseable and predictable.

Quality gain: 5–10% on parsing reliability
Why: Reduces ambiguity in output format
When to use: Any task consuming downstream

Self-Consistency (For Accuracy)

Generate multiple independent completions and take the majority vote. Samples the distribution to find high-likelihood answers.

Quality gain: 5–15% on hard tasks, costs 3–5x
Why: Reduces variance of a single sample
When to use: High-stakes, reasoning-heavy tasks where cost permits

ℹ️ Combination matters: Few-shot + Chain-of-Thought on a math task often outperforms either alone. Test combinations for your specific task.

03 — Debugging

Common Failure Modes and Fixes

Inconsistent Outputs

Problem: Same prompt gives wildly different answers on the same input. Fix: Add 2–3 examples showing the exact format. Examples anchor the output distribution.

Format Errors

Problem: Model returns free text when you asked for JSON. Fix: (1) Add an example in the target format. (2) Use structured output modes (Claude's JSON mode). (3) Repeat the format instruction in the task itself: "Return ONLY valid JSON, no markdown."

Hallucination / Fabrication

Problem: Model invents facts or code that don't exist. Fix: (1) Add examples showing confident uncertainty. (2) Add "Say 'I don't know' if unsure." (3) Use retrieval-augmented generation (RAG). (4) Constrain the answer space: "Choose one of: A, B, C."

Incomplete Responses

Problem: Model cuts off mid-response, especially for long outputs. Fix: (1) Increase max_tokens. (2) Break the task into smaller subtasks. (3) Use streaming to detect truncation early.

Off-Topic Tangents

Problem: Model answers something related but not what you asked. Fix: (1) Be more specific in the task statement. (2) Add a "focus" example that shows depth on the right topic. (3) Add a negative example: "Do NOT explain the history of..."

Tone Mismatch

Problem: Output is too formal, too casual, too verbose. Fix: Use the system role and examples. "You are a concise technical writer" + an example of concise output beats "Be concise."

⚠️ Debugging mindset: Treat prompt tuning like a science: change one thing at a time, test on 3–5 examples, measure the change. It's easy to make a prompt "better" on your favorite example while breaking others.

04 — Implementation

Working Code Examples

1. Zero-Shot Baseline

No examples. Just ask. Useful as a baseline to compare against. Usually the weakest approach on structured tasks.

from anthropic import Anthropic client = Anthropic() # Zero-shot: no examples, no structure response = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=64, messages=[{ 'role': 'user', 'content': 'Classify sentiment: "The product broke on day 1."' }] ) print(response.content[0].text)

2. Few-Shot Classification

Add 2–3 examples showing the format and boundaries. Watch the quality jump.

FEW_SHOT = """Classify sentiment (positive/negative/neutral): "Great value" -> positive "Average experience" -> neutral "Never buying again" -> negative "The product broke on day 1." ->""" response = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=8, messages=[{'role': 'user', 'content': FEW_SHOT}] ) print(response.content[0].text)

3. Chain-of-Thought (Reasoning)

For logic or math, ask the model to think step-by-step. This unlocks multi-step reasoning.

response = client.messages.create( model='claude-opus-4-5', max_tokens=512, messages=[{ 'role': 'user', 'content': ( 'A store sells 3 apples for $1. I buy 12. ' 'Think step by step, then state the total cost.' ) }] ) print(response.content[0].text)

4. System Prompt Persona

Use the system parameter to set role and constraints. This shapes behavior more subtly than in-message instructions.

response = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=256, system=( 'You are a senior Python engineer. ' 'Be concise. Always include code examples. ' 'Use type hints.' ), messages=[{ 'role': 'user', 'content': 'How do I reverse a list?' }] ) print(response.content[0].text)

5. Structured JSON Output

Combine few-shot + output schema for maximum reliability. State the schema twice: once in the system, once in the task.

system = """You are a code reviewer. Be critical but fair. Always respond with valid JSON: {"verdict": "pass|needs_revision|fail", "issues": [...], "suggestion": "..."}""" user_task = """Code to review: def add(a, b): return a + b Respond with JSON. No markdown.""" response = client.messages.create( model='claude-opus-4-5', max_tokens=256, system=system, messages=[{'role': 'user', 'content': user_task}] ) print(response.content[0].text)

05 — Discipline

Prompt Versioning and Testing

Prompts are code. Version them. Test them. Track their performance. A single poorly tuned prompt in production can silently degrade your entire product.

Three-Tier Testing

🧪 Unit Tests (3–5 examples)

Test edge cases: empty input, long input, special chars
Test boundaries: happy path + one failure case
Automate: run on every prompt change

📊 Benchmark (20–50 examples)

Represents real traffic distribution
Measure: accuracy, latency, cost
Track over time as you iterate

🚀 Production Monitor (real traffic)

Sample outputs, grade by hand
Alert if quality dips >5%
Have a rollback plan (previous prompt version)

A/B Test

New prompt vs. old on 5–10% of traffic
Measure delta in quality metric
Deploy only if win is statistical & material

Version Control Strategy

Store prompts as code (not as YAML magic strings). Example structure:

prompts/ ├── sentiment_classifier/ │ ├── v1.0.txt (baseline: zero-shot) │ ├── v1.1.txt (added 2 examples → +15% accuracy) │ ├── v1.2.txt (changed wording → +2% accuracy) │ └── v2.0.txt (current: few-shot + structured output) ├── code_reviewer/ │ ├── v1.0.txt │ └── v1.1.txt └── test_cases.jsonl (3-5 examples per task)

A/B Testing Workflow

Write candidate prompt. Based on failure analysis, add examples or examples or restructure.

Test on benchmark. Run on 20–50 test cases. Score. Compare to baseline.

If improvement >5%: Deploy to 5–10% of traffic for 1 week.

Monitor metrics. Manual spot-checks + automated scoring. Alert if quality dips.

If stable, roll out to 100%. Keep old prompt version as rollback. Commit to version control with date and A/B results.

✓ Best practice: Keep 2–3 versions of each prompt in production. If the current version breaks, rollback immediately instead of debugging live.

06 — Progression

What to Explore Next

Prompting is foundational. Once you master basic techniques, explore these deeper concepts:

Child Concept

Basic Techniques

Zero-shot, few-shot, system prompts, role prompting foundations.

Child Concept

Advanced Reasoning

Chain-of-thought variants, Tree of Thoughts, ReAct, multi-step reasoning patterns.

Child Concept

Programmatic Prompting

DSPy, LMQL, Outlines — automatic prompt optimization and constraint satisfaction.

Child Concept

Output Control

JSON mode, constrained decoding, structured outputs, grammar-guided generation.

Learning path: Start with basic techniques + code examples. Move to advanced reasoning for complex tasks. Explore programmatic prompting for large-scale optimization. Use output control for production reliability.

07 — Further Reading

References

Academic Papers

Paper Brown, T. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165. — arxiv:2005.14165 ↗
Paper Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903. — arxiv:2201.11903 ↗
Paper Liu, P. et al. (2023). Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv:2107.13586. — arxiv:2107.13586 ↗

Official Documentation

Docs Anthropic. (2025). Prompt Engineering Guide. — docs.anthropic.com ↗
Guide OpenAI. (2024). Best Practices for Prompt Engineering. — platform.openai.com ↗

Further Exploration

Paper Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171. — arxiv:2203.11171 ↗

Prompt Chaining and Pipelines

Single prompts rarely solve complex tasks. Chaining breaks a hard problem into sequential sub-tasks, each with a focused prompt. The output of one call becomes the input to the next. Chains are deterministic (no branching) and easy to debug, making them the right default before reaching for agents.

Common chain patterns: Decompose → Solve → Aggregate (split a big question, answer each part, merge), Draft → Critique → Revise (generate then self-correct), and Extract → Transform → Format (parse raw input, process, structure output). DSPy and LangChain Expression Language both formalise chains, but plain Python function calls work just as well for short chains.

import openai client = openai.OpenAI() def call(prompt, system="You are a helpful assistant.", model="gpt-4o-mini"): return client.chat.completions.create( model=model, messages=[{"role":"system","content":system}, {"role":"user","content":prompt}] ).choices[0].message.content # Chain: Decompose → Solve → Synthesise def answer_complex(question: str) -> str: # Step 1: decompose into sub-questions sub_qs = call( f"Break this question into 3 focused sub-questions: {question}", system="Output only a numbered list." ) # Step 2: answer each sub-question answers = call( f"Answer each sub-question concisely: {sub_qs}", system="Be factual and brief." ) # Step 3: synthesise final answer final = call( f"Original question: {question} Sub-answers: {answers} Write a concise final answer.", system="Synthesise the sub-answers into a single, coherent response." ) return final print(answer_complex("How should I choose between RAG and fine-tuning for a customer support bot?"))

Pattern	Use when	Trade-off
Sequential chain	Steps are dependent and linear	Latency adds up; errors propagate
Parallel fan-out	Sub-tasks are independent	Lower latency; needs merge step
Draft → critique	Quality matters more than speed	2× LLM calls; usually worth it
Agent loop	Path is unknown at design time	Unpredictable steps; harder to debug

Prompting LLMs

In This Section

The Anatomy of a Production-Grade Prompt

The Four Components

Prompt Structure in Practice

Techniques Ranked by Impact

Few-Shot Examples (Highest Leverage)

Chain-of-Thought (For Reasoning)

System Role (For Consistency)

Output Schema (For Reliability)

Self-Consistency (For Accuracy)

Common Failure Modes and Fixes

Inconsistent Outputs

Format Errors

Hallucination / Fabrication

Incomplete Responses

Off-Topic Tangents

Tone Mismatch

Working Code Examples

1. Zero-Shot Baseline

2. Few-Shot Classification

3. Chain-of-Thought (Reasoning)

4. System Prompt Persona

5. Structured JSON Output

Prompt Versioning and Testing

Three-Tier Testing

🧪 Unit Tests (3–5 examples)

📊 Benchmark (20–50 examples)

🚀 Production Monitor (real traffic)

A/B Test

Version Control Strategy

A/B Testing Workflow

What to Explore Next

References

Prompt Chaining and Pipelines

Related concepts