Few-shot prompting: teach the model with examples
Few-shot prompting means showing the model a handful of input/output examples before the real task. Learn how to pick examples, format them, and avoid the most common pitfalls.
You spent 30 minutes writing instructions. "Output a JSON object. Use snake_case keys. Wrap dates in ISO 8601. If a field is missing, use null. Don't add commentary. Capitalize proper nouns but lowercase generic terms." The model gets it 75% right and fails the 25% in unpredictable, frustrating ways.
Now you delete the instructions and replace them with three actual examples — input, then output, three times. Suddenly the model nails the format every single time. Same task. Less effort. Better results.
That's the magic of few-shot prompting. Show the model what you want instead of describing it. Once you internalize it, you start seeing opportunities for it everywhere.
The whole idea in one line
The mental model: demonstration as specification#
Instructions try to describe a pattern in words. Examples are the pattern. The model is far better at imitating a demonstrated pattern than at interpreting a verbal description of one — for the same reason humans learn cooking faster from watching someone than from reading a textbook.
Few-shot prompting exploits a property of modern LLMs called in-context learning — the ability to learn a task from examples in the prompt, without any weight updates. You're effectively fine-tuning the model for the duration of one call. Free, instant, ephemeral.
Anatomy of a few-shot prompt#
Three parts in this exact order:
- Instruction (optional but useful) — one or two sentences setting the task.
- Examples — 2 to 5 input/output pairs in identical format.
- The real input — the same shape as the examples, with the output left blank for the model to fill in.
Classify each email into exactly one category:
support, billing, sales, spam.
Email: "When is my refund coming through?"
Category: billing
Email: "Demo of your enterprise plan?"
Category: sales
Email: "URGENT: claim your prize now."
Category: spam
Email: "{{email_body}}"
Category:The model doesn't need a definition of "billing" or "spam." The examples define them. And by ending mid-pattern (after the last Category:), you've given the model exactly one thing to do — complete the pattern.
Why few-shot is so effective#
Two mechanisms compound:
- Format anchoring. The model learns the exact output structure from the examples — punctuation, capitalization, length, key names, all of it. Things you'd need a paragraph to describe.
- Edge-case demonstration. A tricky example shows the model how to handle a tricky case better than any instruction can describe it. "If the email is promotional, classify as spam" is fuzzy. A spam example that contains the word URGENT is concrete.
Four rules for picking examples#
Rule 1: cover the diversity of inputs#
Your examples should span the kinds of inputs the model will see in production. For a classifier, include at least one example per category. For a generator, include examples of different lengths, tones, or topics — whatever varies in real usage.
Rule 2: include the tricky case on purpose#
The example that costs you the most to write is the one that helps the most. Pick the input you'd most fear getting wrong, and put a perfect handling of it in the prompt. Edge-case demonstrations carry more weight than easy ones.
Rule 3: balance the distribution#
If 4 of your 5 examples classify as "positive," the model will lean positive on ambiguous inputs. This is real and measurable. If your real distribution is skewed, balance the examples anyway — you can correct downstream.
Rule 4: identical format across every example#
If example 1 says Category: billing and example 2 says Category - billing, you've confused the model. Use exactly the same separators, casing, and key names across all examples — and the real query. This is the rule that catches most subtle few-shot bugs.
How many examples? Two questions#
Question 1: How varied are your inputs? If the task has 5 distinct flavors, you need at least 5 examples to demonstrate them. If it has 1, you can get away with 2.
Question 2: How tight do you need the format? The stricter the format requirements (specific JSON keys, punctuation, length), the more examples earn their keep.
How many examples to include
| If your situation is… | Reach for… | Why |
|---|---|---|
| Common task, format the model already knows | 0 (zero-shot) | Translation, sentiment, summarization rarely need examples |
| Moderately custom format, low input variance | 2–3 | Enough to anchor format; cheap on tokens |
| Most production tasks | 3–5 | The sweet spot — covers variance, locks format, keeps prompt small |
| High-stakes outputs, strict format, varied inputs | 5–8 | Each example earns its tokens via reduced regression risk |
| More than 10 examples | Refactor | Diminishing returns; consider chaining or fine-tuning instead |
The recency effect
A real-world worked example#
Convert raw standup notes into a structured update.
Output exactly this format:
**Yesterday:** <one line>
**Today:** <one line>
**Blockers:** <one line, or "none">
---
Notes: "shipped the auth refactor; pairing with Sam on rate limit; nothing blocking"
**Yesterday:** Shipped the auth refactor.
**Today:** Pairing with Sam on rate limit.
**Blockers:** none
---
Notes: "stuck on flaky test in checkout, will keep trying, working on PR review queue"
**Yesterday:** Working through the PR review queue.
**Today:** Continue debugging the flaky checkout test.
**Blockers:** Flaky test in checkout suite.
---
Notes: "cant deploy because staging is down, draft RFC for new caching layer"
**Yesterday:** Drafted the RFC for the new caching layer.
**Today:** Same RFC; gathering review.
**Blockers:** Staging is down — can't deploy.
---
Notes: "{{raw_notes}}"Notice how much the examples teach: the order of fields, the wording of none, the verb-led phrasing, the handling of multi-clause notes, the direction of past/future tense. None of that is in the instruction.
When few-shot is the wrong choice#
- The task is genuinely simple. "Translate to Spanish" needs no examples — every model knows what Spanish looks like.
- Your examples conflict with each other. Inconsistent "good" outputs produce inconsistent model outputs. Fix the spec first, then write examples.
- Token budget is at a premium. Each example costs tokens. At very high volumes, even small per-call costs matter — measure whether the quality lift justifies the cost.
- You're using a reasoning model. o1, o3, Claude with extended thinking, Gemini thinking mode — few-shot examples sometimes hurt these. Concise zero-shot prompts often win.
Going further: advanced few-shot patterns#
Few-shot Chain-of-Thought#
Combine few-shot with Chain-of-Thought by including reasoning steps in each example, not just the input/output pair. The model learns both the answer pattern and the reasoning shape. Strongest for classification with rationale or any task where you want explanations.
Dynamic example selection#
Instead of fixed examples, retrieve the most relevant examples for each new input from a larger pool. Same infrastructure as RAG — embed examples, embed the query, pick the top K most similar. Useful when no fixed set covers the input space.
Stratified examples#
For multi-class problems, ensure your example set covers each class with at least one demonstration. For ranking problems, include examples spanning the full score range. The diversity rule above, applied with discipline.
Examples in the system prompt#
On Claude and GPT-4o, putting examples in the system prompt instead of the user message slightly improves instruction-following. The model treats system-prompt examples as more authoritative. Wrap each example in<example> tags on Claude.
Common mistakes#
- Inconsistent format across examples. The single biggest source of few-shot drift. Audit every example for byte-identical structure.
- All examples handle the easy case. The model performs brilliantly on easy inputs and catastrophically on hard ones. Include at least one hard example.
- Examples that contradict the instruction. If the instruction says "1-line output" but examples are 3 lines, the examples win. The model follows demonstrations over directives.
- Hardcoding examples that age out. When your product, brand voice, or rules change, the examples in your prompts go stale. Version them — see version control for prompts.
- Not extracting the structure into a template. Few-shot prompts get reused more than any other kind. Pull the variable parts out — see prompt variables.
Quick reference#
The 60-second summary
What it is: include 2-5 input/output examples in the prompt before the real query. The model imitates the pattern.
The four rules: diverse, include the tricky case, balanced distribution, identical format every time.
The sweet spot: 3-5 examples for most tasks. 5-8 for strict format with high input variance. More than 10 = refactor.
When to skip: simple tasks, reasoning models, token-budget-critical workloads.
What to read next#
Few-shot is the foundation. To layer reasoning on top, read Chain-of-Thought. When examples vary by user or context, see Prompt variables for how to keep templates clean. For the original paper (Brown et al., 2020 — the GPT-3 paper that discovered few-shot), see papers.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.