Include 2โ5 input/output examples in your prompt to show the model exactly what you want. The fastest way to improve output quality without fine-tuning.
Think of it like an employee's first day. You could write a 10-page style guide (that's a long system prompt), or you could hand them three emails and say "write like this." Few-shot is the second approach: show, don't just tell.
You include concrete input/output pairs directly in the prompt. The model doesn't learn or update โ it infers the pattern from those examples and applies it to your new input. When the prompt window closes, the "learning" disappears. This is in-context learning, not training.
Some things are almost impossible to describe but obvious to demonstrate. Try writing instructions that produce exactly this output format:
You could spend 50 words describing that format, or just show three examples. The model extracts the pattern instantly. This is where few-shot wins: implicit style, unusual formats, domain-specific conventions, and cases where your instructions would be longer than your examples.
When instructions are better: Clear rules ("always reply in French"), absolute constraints ("never include PII"), and factual requirements ("cite a source for every claim"). Instructions are more reliable for hard constraints; examples are more reliable for style and format.
Bad examples hurt more than no examples. The model will replicate mistakes, unusual edge cases, or the wrong register if your examples contain them.
| Shots | When to use | Trade-off |
|---|---|---|
| 1 (one-shot) | Format hint โ model already knows the task well | Low cost, low format lock-in |
| 2โ3 | Most tasks. Enough for pattern, not too much context | Sweet spot for quality vs token cost |
| 4โ5 | Nuanced classification, unusual formats | Higher quality, more tokens |
| 6โ10+ | Rarely needed. Complex multi-label classification | Diminishing returns; approaching fine-tuning territory |
Order matters: Put your most representative example last โ the model weights recent examples more. If you have one "ideal" example, make it the final one before the task.
This is the most common decision point. A few rules of thumb:
Static few-shot examples chosen once at prompt design time perform significantly worse than dynamically selected examples that are semantically similar to each incoming query. Retrieving the most similar examples from a labeled example store using embedding similarity โ the same mechanism as RAG document retrieval โ provides examples that demonstrate the output format on inputs resembling the current query. Studies on few-shot performance consistently show that semantically relevant examples improve accuracy by 10โ30 percentage points compared to random or hand-picked static examples, particularly for tasks with diverse input distributions where no single static set of examples covers the full input space.
Not all labeled examples are equal as few-shot demonstrations. High-quality few-shot examples are unambiguous โ the correct output is clearly derivable from the input without domain expertise. They cover edge cases and difficult patterns rather than only easy, representative examples. They demonstrate consistent formatting with no variation in structure or style. And they are diverse enough that together they cover the input space the model will encounter. Curating examples to these criteria rather than randomly sampling from a labeled dataset typically produces larger accuracy improvements per example added.
| Example quality factor | Impact | How to assess |
|---|---|---|
| Semantic similarity to query | High | Cosine similarity with test query embedding |
| Label correctness | Critical | Human review or LLM-as-judge audit |
| Output format consistency | Medium | Regex/schema validation across examples |
| Diversity of input coverage | Medium | Clustering examples by embedding |
Few-shot ordering effects โ the observation that different orderings of the same examples produce significantly different accuracy results โ are a documented property of in-context learning that requires explicit management. The last example in the few-shot list tends to have disproportionate influence on the model's next output, a recency bias effect. Placing the most typical or representative example last and the most edge-case examples earlier mitigates this bias. When example order cannot be controlled (as in dynamic example retrieval), averaging predictions over multiple orderings provides more stable accuracy estimates but doubles inference cost.
Few-shot calibration โ the tendency of few-shot models to assign output class probabilities proportional to class frequency in the examples rather than in the true distribution โ is a systematic bias that degrades classification accuracy. A few-shot classifier with 3 positive and 1 negative example will assign higher prior probability to positive outputs regardless of the input, producing overconfident positive predictions. Calibration techniques like contextual calibration, which estimates and subtracts the base rate bias from few-shot logits, significantly improve classification accuracy in imbalanced scenarios without requiring additional labeled data.
Chain-of-thought few-shot examples that include explicit reasoning traces produce significantly higher accuracy on complex reasoning tasks than examples with direct input-output pairs. The reasoning trace in the examples implicitly teaches the model the problem-solving strategy โ how to decompose the problem, what intermediate quantities to compute, and how to check the answer โ rather than just the input-output mapping. For arithmetic, logical reasoning, and commonsense inference tasks, few-shot CoT examples with 3โ5 demonstrations routinely outperform 10+ direct answer examples, making the example quality (reasoning traces) more important than the example count.
Negative examples in few-shot prompts โ demonstrating incorrect outputs alongside correct outputs with explicit labeling โ can improve boundary learning for classification and extraction tasks. Showing the model what not to produce (irrelevant extractions, hallucinated entities, over-specified answers) alongside positive examples provides the contrastive signal needed to define the task boundary precisely. The effectiveness of negative examples varies by task: tasks with ambiguous boundaries (like relevance classification) benefit more from negative examples than tasks with clear criteria (like structured data extraction where schema compliance is objectively verifiable).
Label leakage in few-shot examples occurs when the example selection process accidentally includes examples similar to the test queries, inflating measured few-shot performance. When evaluating few-shot accuracy on a test set, examples must be drawn from a separate labeled pool with no overlap with the test distribution. For dynamic example selection using semantic similarity retrieval, ensuring that the example pool and the test set were drawn from different time periods or different data sources provides a clean separation that prevents leakage from artificially high similarity between selected examples and test queries.