Few-Shot Prompting

What few-shot is (and isn't)
Why examples beat instructions
Choosing good examples
The right number of shots
Few-shot vs fine-tuning
Code patterns

SECTION 01

What few-shot is (and isn't)

Think of it like an employee's first day. You could write a 10-page style guide (that's a long system prompt), or you could hand them three emails and say "write like this." Few-shot is the second approach: show, don't just tell.

You include concrete input/output pairs directly in the prompt. The model doesn't learn or update — it infers the pattern from those examples and applies it to your new input. When the prompt window closes, the "learning" disappears. This is in-context learning, not training.

What examples teach: Format (JSON vs prose), style (formal vs casual), length, which details to include or omit, edge-case handling. They're most powerful for tasks where the "right" output is hard to specify in words but obvious from examples.

SECTION 02

Why examples beat instructions

Some things are almost impossible to describe but obvious to demonstrate. Try writing instructions that produce exactly this output format:

Input: "The meeting ran long and we didn't get to Q3 planning." Output: ⚠️ RISK · Q3 planning · reschedule before month end

You could spend 50 words describing that format, or just show three examples. The model extracts the pattern instantly. This is where few-shot wins: implicit style, unusual formats, domain-specific conventions, and cases where your instructions would be longer than your examples.

When instructions are better: Clear rules ("always reply in French"), absolute constraints ("never include PII"), and factual requirements ("cite a source for every claim"). Instructions are more reliable for hard constraints; examples are more reliable for style and format.

SECTION 03

Choosing good examples

Bad examples hurt more than no examples. The model will replicate mistakes, unusual edge cases, or the wrong register if your examples contain them.

# What makes a good few-shot example ✓ Representative: covers the main case, not a weird edge case ✓ Correct: the output you'd actually ship to a user ✓ Consistent: same format, same level of detail across all examples ✓ Diverse (for classification): each example from a different class ✗ Avoid: borderline cases, your longest/shortest outputs, anything you'd hesitate to show a new hire

# Example: sentiment classification (good few-shot) User: Classify the sentiment. Examples: Input: "The delivery was two days late and the packaging was crushed." Output: NEGATIVE Input: "Exactly what I ordered, arrived fast." Output: POSITIVE Input: "It's fine I guess, does what it says." Output: NEUTRAL Now classify: Input: "I've reordered three times — can't imagine using anything else." Output:

SECTION 04

The right number of shots

Shots	When to use	Trade-off
1 (one-shot)	Format hint — model already knows the task well	Low cost, low format lock-in
2–3	Most tasks. Enough for pattern, not too much context	Sweet spot for quality vs token cost
4–5	Nuanced classification, unusual formats	Higher quality, more tokens
6–10+	Rarely needed. Complex multi-label classification	Diminishing returns; approaching fine-tuning territory

Order matters: Put your most representative example last — the model weights recent examples more. If you have one "ideal" example, make it the final one before the task.

SECTION 05

Few-shot vs fine-tuning

This is the most common decision point. A few rules of thumb:

Try few-shot first. It takes 10 minutes and zero cost. If 3–5 examples get you to acceptable quality, you're done.
Fine-tune when: you need consistent quality across thousands of diverse inputs, the task requires domain knowledge not in the base model, or your few-shot prompt is growing past 20 examples and still unreliable.
Your few-shot examples become training data. If you end up fine-tuning, those carefully curated examples are your first training set — the work isn't wasted.

Cost math: GPT-4 few-shot at 1000 req/day with 5 examples ≈ $15/day in extra tokens. Fine-tuning GPT-3.5 on 1000 examples ≈ $1 one-time. At scale, fine-tuning pays off fast — but only if quality is actually better.

SECTION 06

Code patterns

import anthropic client = anthropic.Anthropic() # Pattern 1: Static few-shot examples inline EXAMPLES = [ ("The refund took 3 weeks", "NEGATIVE"), ("Solid product, good price", "POSITIVE"), ("Nothing special but works", "NEUTRAL"), ] def classify(review: str) -> str: examples_text = "\n\n".join( f"Review: {inp}\nSentiment: {out}" for inp, out in EXAMPLES ) prompt = f"{examples_text}\n\nReview: {review}\nSentiment:" return client.messages.create( model="claude-opus-4-6", max_tokens=10, messages=[{"role": "user", "content": prompt}] ).content[0].text.strip() # Pattern 2: Dynamic examples retrieved by similarity # (select examples most similar to the current input) def classify_with_dynamic_examples(review: str, example_pool: list) -> str: # Sort example_pool by embedding similarity to review # (use your embedding model here) top_k = retrieve_similar(review, example_pool, k=3) examples_text = "\n\n".join( f"Review: {ex['input']}\nSentiment: {ex['output']}" for ex in top_k ) prompt = f"{examples_text}\n\nReview: {review}\nSentiment:" return client.messages.create( model="claude-opus-4-6", max_tokens=10, messages=[{"role": "user", "content": prompt}] ).content[0].text.strip()

Dynamic few-shot: Retrieving examples by similarity to the current input (rather than using fixed examples) often outperforms static few-shot by 10–20%. It's the right default once you have an example pool larger than 10.

Dynamic few-shot example selection

Static few-shot examples chosen once at prompt design time perform significantly worse than dynamically selected examples that are semantically similar to each incoming query. Retrieving the most similar examples from a labeled example store using embedding similarity — the same mechanism as RAG document retrieval — provides examples that demonstrate the output format on inputs resembling the current query. Studies on few-shot performance consistently show that semantically relevant examples improve accuracy by 10–30 percentage points compared to random or hand-picked static examples, particularly for tasks with diverse input distributions where no single static set of examples covers the full input space.

Few-shot example quality criteria

Not all labeled examples are equal as few-shot demonstrations. High-quality few-shot examples are unambiguous — the correct output is clearly derivable from the input without domain expertise. They cover edge cases and difficult patterns rather than only easy, representative examples. They demonstrate consistent formatting with no variation in structure or style. And they are diverse enough that together they cover the input space the model will encounter. Curating examples to these criteria rather than randomly sampling from a labeled dataset typically produces larger accuracy improvements per example added.

Example quality factor	Impact	How to assess
Semantic similarity to query	High	Cosine similarity with test query embedding
Label correctness	Critical	Human review or LLM-as-judge audit
Output format consistency	Medium	Regex/schema validation across examples
Diversity of input coverage	Medium	Clustering examples by embedding

Few-shot ordering effects — the observation that different orderings of the same examples produce significantly different accuracy results — are a documented property of in-context learning that requires explicit management. The last example in the few-shot list tends to have disproportionate influence on the model's next output, a recency bias effect. Placing the most typical or representative example last and the most edge-case examples earlier mitigates this bias. When example order cannot be controlled (as in dynamic example retrieval), averaging predictions over multiple orderings provides more stable accuracy estimates but doubles inference cost.

Few-shot calibration — the tendency of few-shot models to assign output class probabilities proportional to class frequency in the examples rather than in the true distribution — is a systematic bias that degrades classification accuracy. A few-shot classifier with 3 positive and 1 negative example will assign higher prior probability to positive outputs regardless of the input, producing overconfident positive predictions. Calibration techniques like contextual calibration, which estimates and subtracts the base rate bias from few-shot logits, significantly improve classification accuracy in imbalanced scenarios without requiring additional labeled data.

Chain-of-thought few-shot examples that include explicit reasoning traces produce significantly higher accuracy on complex reasoning tasks than examples with direct input-output pairs. The reasoning trace in the examples implicitly teaches the model the problem-solving strategy — how to decompose the problem, what intermediate quantities to compute, and how to check the answer — rather than just the input-output mapping. For arithmetic, logical reasoning, and commonsense inference tasks, few-shot CoT examples with 3–5 demonstrations routinely outperform 10+ direct answer examples, making the example quality (reasoning traces) more important than the example count.

Negative examples in few-shot prompts — demonstrating incorrect outputs alongside correct outputs with explicit labeling — can improve boundary learning for classification and extraction tasks. Showing the model what not to produce (irrelevant extractions, hallucinated entities, over-specified answers) alongside positive examples provides the contrastive signal needed to define the task boundary precisely. The effectiveness of negative examples varies by task: tasks with ambiguous boundaries (like relevance classification) benefit more from negative examples than tasks with clear criteria (like structured data extraction where schema compliance is objectively verifiable).

Label leakage in few-shot examples occurs when the example selection process accidentally includes examples similar to the test queries, inflating measured few-shot performance. When evaluating few-shot accuracy on a test set, examples must be drawn from a separate labeled pool with no overlap with the test distribution. For dynamic example selection using semantic similarity retrieval, ensuring that the example pool and the test set were drawn from different time periods or different data sources provides a clean separation that prevents leakage from artificially high similarity between selected examples and test queries.

Few-Shot Prompting

Table of Contents

What few-shot is (and isn't)

Why examples beat instructions

Choosing good examples

The right number of shots

Few-shot vs fine-tuning

Code patterns

Dynamic few-shot example selection

Few-shot example quality criteria

Related concepts