Prompt Engineering

Prompting LLMs

The primary interface between you and any LLM — master this before anything else

Few-shot > Instructions The #1 Rule
System + Examples + Task Prompt Anatomy
Test Before Deploy The Discipline

In This Section

01 — Foundation

The Anatomy of a Production-Grade Prompt

Prompting is the primary interface between you and an LLM. A well-designed prompt can make a smaller model outperform a larger one. Poor prompts make even the best models inconsistent. A production-grade prompt has four core components:

The Four Components

ComponentPurposeExample
SystemRole, constraints, output format"You are a senior Python engineer. Always use type hints. Output JSON."
ExamplesShow the exact format and tone you want2–5 input/output pairs demonstrating the task
TaskClear, specific, scoped instruction"Refactor this function to use a dict instead of a loop"
Output FormatExplicit structure of the response"Return JSON: {code, explanation, test_case}"

Not every prompt needs all four, but production systems use all of them. Examples are so powerful that 2–3 worked examples often beat 500 words of instruction.

Pro tip: The fastest quality improvement: add 2–3 worked examples. Models learn format and tone from examples faster than from instructions.

Prompt Structure in Practice

SYSTEM — role, constraints, format: "You are a senior Python engineer. Always use type hints. Always add docstrings. Output JSON with keys: code, reason, time_complexity." EXAMPLES — show the format you want: Input: "Optimize this loop: for i in range(n): sum += arr[i]" Output: {"code": "sum(arr)", "reason": "Built-in is faster", "time_complexity": "O(n) → O(n)"} Input: "Add type hints to this function: def add(a, b): return a + b" Output: {"code": "def add(a: int, b: int) -> int: return a + b", ...} TASK — clear, specific, scoped: "Refactor this function to use a dictionary for O(1) lookups instead of a list." OUTPUT FORMAT — what you want back: "Return JSON with keys: refactored_code, original_complexity, new_complexity, explanation"

This structure is discipline. It forces you to be explicit about what you want, which forces the model to be consistent in what it delivers.

02 — Techniques

Techniques Ranked by Impact

Not all prompting techniques are equal. Some give you 2–3x quality gains; others give 5–10% gains. Here's the ranking by impact:

1

Few-Shot Examples (Highest Leverage)

Give the model 2–5 worked examples of input→output. Models learn the format, tone, and implicit constraints from examples faster than from text instructions.

  • Quality gain: 2–5x on structured tasks
  • Why: LLMs are pattern matchers. Examples are patterns.
  • When to use: Always, for any production task
2

Chain-of-Thought (For Reasoning)

Ask the model to show its reasoning step-by-step before answering. "Think step by step" or "Reasoning:" forces the model to decompose the problem.

  • Quality gain: 10–30% on math, logic, multi-step reasoning
  • Why: Latent reasoning emerges when you force verbalization
  • When to use: Math, logic, causality, multi-hop problems
3

System Role (For Consistency)

Set the system prompt with a role: "You are a senior data scientist." The role shapes tone, vocabulary, and constraints without explicit instruction.

  • Quality gain: 5–15% on consistency and expertise
  • Why: Role primes a particular behavioral distribution
  • When to use: Any task where tone/expertise matters
4

Output Schema (For Reliability)

Explicitly define the output structure as JSON Schema or XML. "Return JSON with keys: {field1, field2}". Structured outputs are more parseable and predictable.

  • Quality gain: 5–10% on parsing reliability
  • Why: Reduces ambiguity in output format
  • When to use: Any task consuming downstream
5

Self-Consistency (For Accuracy)

Generate multiple independent completions and take the majority vote. Samples the distribution to find high-likelihood answers.

  • Quality gain: 5–15% on hard tasks, costs 3–5x
  • Why: Reduces variance of a single sample
  • When to use: High-stakes, reasoning-heavy tasks where cost permits
ℹ️ Combination matters: Few-shot + Chain-of-Thought on a math task often outperforms either alone. Test combinations for your specific task.
03 — Debugging

Common Failure Modes and Fixes

Inconsistent Outputs

Problem: Same prompt gives wildly different answers on the same input. Fix: Add 2–3 examples showing the exact format. Examples anchor the output distribution.

Format Errors

Problem: Model returns free text when you asked for JSON. Fix: (1) Add an example in the target format. (2) Use structured output modes (Claude's JSON mode). (3) Repeat the format instruction in the task itself: "Return ONLY valid JSON, no markdown."

Hallucination / Fabrication

Problem: Model invents facts or code that don't exist. Fix: (1) Add examples showing confident uncertainty. (2) Add "Say 'I don't know' if unsure." (3) Use retrieval-augmented generation (RAG). (4) Constrain the answer space: "Choose one of: A, B, C."

Incomplete Responses

Problem: Model cuts off mid-response, especially for long outputs. Fix: (1) Increase max_tokens. (2) Break the task into smaller subtasks. (3) Use streaming to detect truncation early.

Off-Topic Tangents

Problem: Model answers something related but not what you asked. Fix: (1) Be more specific in the task statement. (2) Add a "focus" example that shows depth on the right topic. (3) Add a negative example: "Do NOT explain the history of..."

Tone Mismatch

Problem: Output is too formal, too casual, too verbose. Fix: Use the system role and examples. "You are a concise technical writer" + an example of concise output beats "Be concise."

⚠️ Debugging mindset: Treat prompt tuning like a science: change one thing at a time, test on 3–5 examples, measure the change. It's easy to make a prompt "better" on your favorite example while breaking others.
04 — Implementation

Working Code Examples

1. Zero-Shot Baseline

No examples. Just ask. Useful as a baseline to compare against. Usually the weakest approach on structured tasks.

from anthropic import Anthropic client = Anthropic() # Zero-shot: no examples, no structure response = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=64, messages=[{ 'role': 'user', 'content': 'Classify sentiment: "The product broke on day 1."' }] ) print(response.content[0].text)

2. Few-Shot Classification

Add 2–3 examples showing the format and boundaries. Watch the quality jump.

FEW_SHOT = """Classify sentiment (positive/negative/neutral): "Great value" -> positive "Average experience" -> neutral "Never buying again" -> negative "The product broke on day 1." ->""" response = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=8, messages=[{'role': 'user', 'content': FEW_SHOT}] ) print(response.content[0].text)

3. Chain-of-Thought (Reasoning)

For logic or math, ask the model to think step-by-step. This unlocks multi-step reasoning.

response = client.messages.create( model='claude-opus-4-5', max_tokens=512, messages=[{ 'role': 'user', 'content': ( 'A store sells 3 apples for $1. I buy 12. ' 'Think step by step, then state the total cost.' ) }] ) print(response.content[0].text)

4. System Prompt Persona

Use the system parameter to set role and constraints. This shapes behavior more subtly than in-message instructions.

response = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=256, system=( 'You are a senior Python engineer. ' 'Be concise. Always include code examples. ' 'Use type hints.' ), messages=[{ 'role': 'user', 'content': 'How do I reverse a list?' }] ) print(response.content[0].text)

5. Structured JSON Output

Combine few-shot + output schema for maximum reliability. State the schema twice: once in the system, once in the task.

system = """You are a code reviewer. Be critical but fair. Always respond with valid JSON: {"verdict": "pass|needs_revision|fail", "issues": [...], "suggestion": "..."}""" user_task = """Code to review: def add(a, b): return a + b Respond with JSON. No markdown.""" response = client.messages.create( model='claude-opus-4-5', max_tokens=256, system=system, messages=[{'role': 'user', 'content': user_task}] ) print(response.content[0].text)
05 — Discipline

Prompt Versioning and Testing

Prompts are code. Version them. Test them. Track their performance. A single poorly tuned prompt in production can silently degrade your entire product.

Three-Tier Testing

🧪 Unit Tests (3–5 examples)

  • Test edge cases: empty input, long input, special chars
  • Test boundaries: happy path + one failure case
  • Automate: run on every prompt change

📊 Benchmark (20–50 examples)

  • Represents real traffic distribution
  • Measure: accuracy, latency, cost
  • Track over time as you iterate

🚀 Production Monitor (real traffic)

  • Sample outputs, grade by hand
  • Alert if quality dips >5%
  • Have a rollback plan (previous prompt version)

A/B Test

  • New prompt vs. old on 5–10% of traffic
  • Measure delta in quality metric
  • Deploy only if win is statistical & material

Version Control Strategy

Store prompts as code (not as YAML magic strings). Example structure:

prompts/ ├── sentiment_classifier/ │ ├── v1.0.txt (baseline: zero-shot) │ ├── v1.1.txt (added 2 examples → +15% accuracy) │ ├── v1.2.txt (changed wording → +2% accuracy) │ └── v2.0.txt (current: few-shot + structured output) ├── code_reviewer/ │ ├── v1.0.txt │ └── v1.1.txt └── test_cases.jsonl (3-5 examples per task)

A/B Testing Workflow

1
Write candidate prompt. Based on failure analysis, add examples or examples or restructure.
2
Test on benchmark. Run on 20–50 test cases. Score. Compare to baseline.
3
If improvement >5%: Deploy to 5–10% of traffic for 1 week.
4
Monitor metrics. Manual spot-checks + automated scoring. Alert if quality dips.
5
5
If stable, roll out to 100%. Keep old prompt version as rollback. Commit to version control with date and A/B results.
Best practice: Keep 2–3 versions of each prompt in production. If the current version breaks, rollback immediately instead of debugging live.
06 — Progression

What to Explore Next

Prompting is foundational. Once you master basic techniques, explore these deeper concepts:

Child Concept
Basic Techniques
Zero-shot, few-shot, system prompts, role prompting foundations.
Child Concept
Advanced Reasoning
Chain-of-thought variants, Tree of Thoughts, ReAct, multi-step reasoning patterns.
Child Concept
Programmatic Prompting
DSPy, LMQL, Outlines — automatic prompt optimization and constraint satisfaction.
Child Concept
Output Control
JSON mode, constrained decoding, structured outputs, grammar-guided generation.

Learning path: Start with basic techniques + code examples. Move to advanced reasoning for complex tasks. Explore programmatic prompting for large-scale optimization. Use output control for production reliability.

07 — Further Reading

References

Academic Papers
Official Documentation
Further Exploration
8

Prompt Chaining and Pipelines

Single prompts rarely solve complex tasks. Chaining breaks a hard problem into sequential sub-tasks, each with a focused prompt. The output of one call becomes the input to the next. Chains are deterministic (no branching) and easy to debug, making them the right default before reaching for agents.

Common chain patterns: Decompose → Solve → Aggregate (split a big question, answer each part, merge), Draft → Critique → Revise (generate then self-correct), and Extract → Transform → Format (parse raw input, process, structure output). DSPy and LangChain Expression Language both formalise chains, but plain Python function calls work just as well for short chains.

import openai client = openai.OpenAI() def call(prompt, system="You are a helpful assistant.", model="gpt-4o-mini"): return client.chat.completions.create( model=model, messages=[{"role":"system","content":system}, {"role":"user","content":prompt}] ).choices[0].message.content # Chain: Decompose → Solve → Synthesise def answer_complex(question: str) -> str: # Step 1: decompose into sub-questions sub_qs = call( f"Break this question into 3 focused sub-questions: {question}", system="Output only a numbered list." ) # Step 2: answer each sub-question answers = call( f"Answer each sub-question concisely: {sub_qs}", system="Be factual and brief." ) # Step 3: synthesise final answer final = call( f"Original question: {question} Sub-answers: {answers} Write a concise final answer.", system="Synthesise the sub-answers into a single, coherent response." ) return final print(answer_complex("How should I choose between RAG and fine-tuning for a customer support bot?"))
PatternUse whenTrade-off
Sequential chainSteps are dependent and linearLatency adds up; errors propagate
Parallel fan-outSub-tasks are independentLower latency; needs merge step
Draft → critiqueQuality matters more than speed2× LLM calls; usually worth it
Agent loopPath is unknown at design timeUnpredictable steps; harder to debug