Basic Prompting

Few-Shot Prompting

Include 2โ€“5 input/output examples in your prompt to show the model exactly what you want. The fastest way to improve output quality without fine-tuning.

2โ€“5
Examples sweet spot
10ร—
Cheaper than fine-tuning
Format + style
What it teaches

Table of Contents

SECTION 01

What few-shot is (and isn't)

Think of it like an employee's first day. You could write a 10-page style guide (that's a long system prompt), or you could hand them three emails and say "write like this." Few-shot is the second approach: show, don't just tell.

You include concrete input/output pairs directly in the prompt. The model doesn't learn or update โ€” it infers the pattern from those examples and applies it to your new input. When the prompt window closes, the "learning" disappears. This is in-context learning, not training.

What examples teach: Format (JSON vs prose), style (formal vs casual), length, which details to include or omit, edge-case handling. They're most powerful for tasks where the "right" output is hard to specify in words but obvious from examples.
SECTION 02

Why examples beat instructions

Some things are almost impossible to describe but obvious to demonstrate. Try writing instructions that produce exactly this output format:

Input: "The meeting ran long and we didn't get to Q3 planning." Output: โš ๏ธ RISK ยท Q3 planning ยท reschedule before month end

You could spend 50 words describing that format, or just show three examples. The model extracts the pattern instantly. This is where few-shot wins: implicit style, unusual formats, domain-specific conventions, and cases where your instructions would be longer than your examples.

When instructions are better: Clear rules ("always reply in French"), absolute constraints ("never include PII"), and factual requirements ("cite a source for every claim"). Instructions are more reliable for hard constraints; examples are more reliable for style and format.

SECTION 03

Choosing good examples

Bad examples hurt more than no examples. The model will replicate mistakes, unusual edge cases, or the wrong register if your examples contain them.

# What makes a good few-shot example โœ“ Representative: covers the main case, not a weird edge case โœ“ Correct: the output you'd actually ship to a user โœ“ Consistent: same format, same level of detail across all examples โœ“ Diverse (for classification): each example from a different class โœ— Avoid: borderline cases, your longest/shortest outputs, anything you'd hesitate to show a new hire
# Example: sentiment classification (good few-shot) User: Classify the sentiment. Examples: Input: "The delivery was two days late and the packaging was crushed." Output: NEGATIVE Input: "Exactly what I ordered, arrived fast." Output: POSITIVE Input: "It's fine I guess, does what it says." Output: NEUTRAL Now classify: Input: "I've reordered three times โ€” can't imagine using anything else." Output:
SECTION 04

The right number of shots

ShotsWhen to useTrade-off
1 (one-shot)Format hint โ€” model already knows the task wellLow cost, low format lock-in
2โ€“3Most tasks. Enough for pattern, not too much contextSweet spot for quality vs token cost
4โ€“5Nuanced classification, unusual formatsHigher quality, more tokens
6โ€“10+Rarely needed. Complex multi-label classificationDiminishing returns; approaching fine-tuning territory

Order matters: Put your most representative example last โ€” the model weights recent examples more. If you have one "ideal" example, make it the final one before the task.

SECTION 05

Few-shot vs fine-tuning

This is the most common decision point. A few rules of thumb:

Cost math: GPT-4 few-shot at 1000 req/day with 5 examples โ‰ˆ $15/day in extra tokens. Fine-tuning GPT-3.5 on 1000 examples โ‰ˆ $1 one-time. At scale, fine-tuning pays off fast โ€” but only if quality is actually better.
SECTION 06

Code patterns

import anthropic client = anthropic.Anthropic() # Pattern 1: Static few-shot examples inline EXAMPLES = [ ("The refund took 3 weeks", "NEGATIVE"), ("Solid product, good price", "POSITIVE"), ("Nothing special but works", "NEUTRAL"), ] def classify(review: str) -> str: examples_text = "\n\n".join( f"Review: {inp}\nSentiment: {out}" for inp, out in EXAMPLES ) prompt = f"{examples_text}\n\nReview: {review}\nSentiment:" return client.messages.create( model="claude-opus-4-6", max_tokens=10, messages=[{"role": "user", "content": prompt}] ).content[0].text.strip() # Pattern 2: Dynamic examples retrieved by similarity # (select examples most similar to the current input) def classify_with_dynamic_examples(review: str, example_pool: list) -> str: # Sort example_pool by embedding similarity to review # (use your embedding model here) top_k = retrieve_similar(review, example_pool, k=3) examples_text = "\n\n".join( f"Review: {ex['input']}\nSentiment: {ex['output']}" for ex in top_k ) prompt = f"{examples_text}\n\nReview: {review}\nSentiment:" return client.messages.create( model="claude-opus-4-6", max_tokens=10, messages=[{"role": "user", "content": prompt}] ).content[0].text.strip()
Dynamic few-shot: Retrieving examples by similarity to the current input (rather than using fixed examples) often outperforms static few-shot by 10โ€“20%. It's the right default once you have an example pool larger than 10.

Dynamic few-shot example selection

Static few-shot examples chosen once at prompt design time perform significantly worse than dynamically selected examples that are semantically similar to each incoming query. Retrieving the most similar examples from a labeled example store using embedding similarity โ€” the same mechanism as RAG document retrieval โ€” provides examples that demonstrate the output format on inputs resembling the current query. Studies on few-shot performance consistently show that semantically relevant examples improve accuracy by 10โ€“30 percentage points compared to random or hand-picked static examples, particularly for tasks with diverse input distributions where no single static set of examples covers the full input space.

Few-shot example quality criteria

Not all labeled examples are equal as few-shot demonstrations. High-quality few-shot examples are unambiguous โ€” the correct output is clearly derivable from the input without domain expertise. They cover edge cases and difficult patterns rather than only easy, representative examples. They demonstrate consistent formatting with no variation in structure or style. And they are diverse enough that together they cover the input space the model will encounter. Curating examples to these criteria rather than randomly sampling from a labeled dataset typically produces larger accuracy improvements per example added.

Example quality factorImpactHow to assess
Semantic similarity to queryHighCosine similarity with test query embedding
Label correctnessCriticalHuman review or LLM-as-judge audit
Output format consistencyMediumRegex/schema validation across examples
Diversity of input coverageMediumClustering examples by embedding

Few-shot ordering effects โ€” the observation that different orderings of the same examples produce significantly different accuracy results โ€” are a documented property of in-context learning that requires explicit management. The last example in the few-shot list tends to have disproportionate influence on the model's next output, a recency bias effect. Placing the most typical or representative example last and the most edge-case examples earlier mitigates this bias. When example order cannot be controlled (as in dynamic example retrieval), averaging predictions over multiple orderings provides more stable accuracy estimates but doubles inference cost.

Few-shot calibration โ€” the tendency of few-shot models to assign output class probabilities proportional to class frequency in the examples rather than in the true distribution โ€” is a systematic bias that degrades classification accuracy. A few-shot classifier with 3 positive and 1 negative example will assign higher prior probability to positive outputs regardless of the input, producing overconfident positive predictions. Calibration techniques like contextual calibration, which estimates and subtracts the base rate bias from few-shot logits, significantly improve classification accuracy in imbalanced scenarios without requiring additional labeled data.

Chain-of-thought few-shot examples that include explicit reasoning traces produce significantly higher accuracy on complex reasoning tasks than examples with direct input-output pairs. The reasoning trace in the examples implicitly teaches the model the problem-solving strategy โ€” how to decompose the problem, what intermediate quantities to compute, and how to check the answer โ€” rather than just the input-output mapping. For arithmetic, logical reasoning, and commonsense inference tasks, few-shot CoT examples with 3โ€“5 demonstrations routinely outperform 10+ direct answer examples, making the example quality (reasoning traces) more important than the example count.

Negative examples in few-shot prompts โ€” demonstrating incorrect outputs alongside correct outputs with explicit labeling โ€” can improve boundary learning for classification and extraction tasks. Showing the model what not to produce (irrelevant extractions, hallucinated entities, over-specified answers) alongside positive examples provides the contrastive signal needed to define the task boundary precisely. The effectiveness of negative examples varies by task: tasks with ambiguous boundaries (like relevance classification) benefit more from negative examples than tasks with clear criteria (like structured data extraction where schema compliance is objectively verifiable).

Label leakage in few-shot examples occurs when the example selection process accidentally includes examples similar to the test queries, inflating measured few-shot performance. When evaluating few-shot accuracy on a test set, examples must be drawn from a separate labeled pool with no overlap with the test distribution. For dynamic example selection using semantic similarity retrieval, ensuring that the example pool and the test set were drawn from different time periods or different data sources provides a clean separation that prevents leakage from artificially high similarity between selected examples and test queries.