Few-shot examples: when they help, how to write them
When examples are the single biggest unlock you have, and when they hurt. The rules: relevant, diverse, structured. How many is enough. Why one bad example can corrupt the whole output.
Few-shot examples: when they help, how to write them
If you can only learn one prompting technique, learn this one. Showing the model what good output looks like — with 1–5 actual examples — outperforms almost any other intervention.
It's called few-shot prompting (one example = "one-shot," several = "few-shot," zero = "zero-shot," which is what you do by default).
When few-shot is the single biggest unlock
The technique earns its keep when:
- The output format is non-obvious. "Extract all the dates" — easy. "Extract all the dates as ISO strings, but only if they're in the past, and group them by quarter" — show an example.
- The tone is specific. Marketing copy in a specific brand voice, code review comments in a specific style, customer support responses in a specific register. Examples carry tone better than descriptions.
- You want consistency across runs. Few-shot is the most reliable way to get repeatable output shape across many runs of the same prompt.
- The task involves judgment. Classifying customer feedback into categories, deciding whether a PR is risky enough to flag, grading content quality. The model needs to see what each category looks like.
When few-shot doesn't help (or hurts)
- Creative tasks where you want variety. Examples bias the output toward looking like the examples. If you want range, fewer examples or none.
- Tasks the model already does well by default. Asking for a Python function to sort a list doesn't need an example.
- When you don't have good examples. A bad example is worse than no example. The model will faithfully match the wrong pattern.
The three rules for good examples
Rule 1: Relevant
Examples should look like your actual use case. If you're processing real customer feedback, your example should look like real customer feedback — not synthetic, not generic, not "Customer A said X."
Bad example for a sentiment classification task:
<example>
Input: "This is great!"
Output: positive
</example>
Better:
<example>
Input: "ngl was skeptical at first but the migration went way smoother than i expected. one minor thing on the docs - the env var naming is inconsistent w/ what's in the github readme. fixable. 8/10"
Output: positive (with a specific feature complaint to flag — docs inconsistency)
</example>
The second example shows the model the messy reality it'll actually encounter: casual language, abbreviations, mixed sentiment, embedded specific feedback. The first example will produce a classifier that handles "this is great" perfectly and fails on everything else.
Rule 2: Diverse
If all your examples have the same structure, the model assumes that structure is part of the spec. Cover edge cases. Include examples that show what to do with unusual inputs.
For a 5-example set on customer-feedback classification, include:
- A clearly positive one
- A clearly negative one
- A mixed-sentiment one
- An ambiguous one
- An off-topic one (where the model should refuse or flag)
If all 5 are unambiguously positive, the model will struggle the first time it sees ambiguity.
Rule 3: Structured
Wrap each example in <example> tags. Wrap the group in <examples>. This makes it unambiguous to the model that these are patterns to learn from — not instructions to follow literally.
<examples>
<example>
<input>
{a realistic input}
</input>
<output>
{the correct output for that input}
</output>
</example>
<example>
<input>
{a different realistic input — different structure, different edge case}
</input>
<output>
{the correct output}
</output>
</example>
</examples>
The model treats this as "here are some demonstrations of the task." Without the tags, it might treat the examples as part of the input or part of the instructions.
How many examples is enough
The data is consistent across models: 3–5 examples is the sweet spot. After that, returns diminish quickly. After about 10, you can actively hurt output quality by overfitting the model to surface patterns in the examples.
If you have 1 example, use 1. If you have 20, pick the 5 most diverse ones.
Why one bad example can ruin everything
The model is a faithful pattern-matcher. If your examples all happen to:
- Start outputs with "Here is your..."
- End outputs with a question
- Use 3-bullet lists exactly
- Avoid certain words by coincidence
...it will assume that's part of the task. It will reproduce those patterns even when they don't make sense for a particular input.
Inspect your examples for accidental patterns. If 3 of your 5 examples end with a question mark, the model will think questions are mandatory. Either fix the pattern or break it deliberately in one example.
The meta-trick: ask the model to evaluate your examples
Once you have a few examples, paste them to the model and ask:
Here are the examples I'm planning to use for a few-shot prompt:
<examples>
{your examples}
</examples>
Critique them. Specifically:
- Are they diverse enough to cover the actual range of inputs I'll see?
- Are there any accidental patterns across examples that would mislead a model?
- Is there an obvious edge case missing?
If you'd add a 6th example, what would it look like and why?
This turns the model into a prompt critic. It catches patterns you missed and surfaces gaps in coverage you'd otherwise discover in production.
A worked example: extracting structured data from emails
Goal: extract meeting details from forwarded calendar invites.
Without few-shot, you'd write a long description of the format. With few-shot, you show:
<examples>
<example>
<input>
"Hey, can we move our 1:1 from Tuesday to Thursday at the same time? Thursday at 3pm PT works."
</input>
<output>
{
"action": "reschedule",
"from": "Tuesday, time unspecified",
"to": "Thursday 3pm PT",
"type": "1:1",
"confirmed": false
}
</output>
</example>
<example>
<input>
"Confirming our coffee chat tomorrow 9am at Verve. See you then!"
</input>
<output>
{
"action": "confirm",
"from": "tomorrow 9am",
"to": "tomorrow 9am",
"type": "coffee chat",
"location": "Verve",
"confirmed": true
}
</output>
</example>
<example>
<input>
"Hi all - moving the team retro to a recurring monthly cadence. Next one: Tuesday Dec 10 at 2pm ET, 60 min."
</input>
<output>
{
"action": "create_recurring",
"from": null,
"to": "2024-12-10 14:00 ET",
"type": "team retro",
"recurrence": "monthly",
"duration_min": 60,
"confirmed": true
}
</output>
</example>
</examples>
Now extract details from the following email. Use the same JSON shape. If a field doesn't apply, use null. Don't invent fields.
<email>
{the actual email to parse}
</email>
Three examples teach the model: how to handle reschedules, confirms, and recurring creates. They show what to do with missing info. They establish the JSON shape. They demonstrate the "use null, don't invent" rule.
Try writing that prompt without examples. It will be three times as long and produce worse output.
What to read next
- What makes a prompt actually work — the foundations
- Structuring prompts with XML, roles, and sections — the formatting framework that makes examples work
Get the next guide when it lands
One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.