Skip to main content

Zero-Shot vs. Few-Shot in Production: When Examples Help and When They Hurt

· 10 min read
Tian Pan
Software Engineer

The most common advice about few-shot prompting is: add examples, watch quality go up. That advice is wrong often enough that you shouldn't trust it without measuring. In practice, the relationship between examples and performance is non-monotonic — it peaks somewhere and then drops. Sometimes it drops a lot.

A 2025 empirical study tracked 12 LLMs across multiple tasks and found that Gemma 7B fell from 77.9% to 39.9% accuracy on a vulnerability identification task as examples were added beyond the optimal count. LLaMA-2 70B dropped from 68.6% to 21.0% on the same type of task. In code translation benchmarks, functional correctness typically peaks somewhere between 5 and 25 examples and degrades from there. This isn't a quirk of specific models — it's a pattern researchers have named "few-shot collapse," and it shows up broadly.

If you're shipping an LLM feature and making the default few-shot decision based on vibes rather than measurement, you may already be on the wrong side of this curve.

Why Examples Hurt: The Mechanisms

Understanding when few-shot prompting fails requires understanding what the model is actually doing with examples. The intuitive model — that examples teach the model new input-output mappings it didn't know — is mostly wrong. A well-known finding is that randomly replacing the correct labels in few-shot examples barely hurts performance. The model wasn't using the label information the way you thought. It was picking up distribution signals: what format the output should take, what length is appropriate, what vocabulary register you're in.

This means examples can anchor the model to the wrong distribution. If your three examples all use formal language, the model will use formal language even when a casual tone would serve the user better. If your examples happen to omit dollar signs on numbers, the model will omit them throughout — including in places where you need them. If all your examples are from one class in a classification task, the majority label bias will kick in and the model will over-predict that label.

There are at least three distinct failure modes:

Recency bias. Models assign disproportionate weight to examples appearing near the end of a prompt. In classification, swapping the order of examples has been shown to shift accuracy by more than 10 percentage points on some benchmarks — same examples, same task, different ordering.

Majority label bias. When one class appears more frequently in your examples, the model systematically over-predicts it. This is easy to overlook in binary classification where your examples happen to skew 3:1.

Distribution anchor bias. When examples share an incidental characteristic — formatting style, domain vocabulary, sentence length — the model assumes that characteristic is normative and applies it to all inputs, even ones where it doesn't fit.

The deeper issue is that examples become a constraint, not just a hint. As more examples are added, the model increasingly optimizes for pattern-matching to those examples rather than generalizing from the underlying task description. This is the mechanism behind few-shot collapse.

When Zero-Shot Is the Right Default

Zero-shot prompting is underrated in production. It has concrete advantages that get overlooked when teams assume "more is better."

Lower token overhead. Each example adds 50-200 tokens depending on length. At high request volumes, that cost compounds. If you're running 10 million requests per day and your examples add 200 tokens each, you're burning compute for something that may not help.

Better cross-model stability. Zero-shot performance tends to be less sensitive to model version changes. When you switch from one model to another — or when your model provider silently updates weights — zero-shot prompts are less likely to regress. Few-shot prompts that worked with one model can fail unexpectedly with another because the new model has different sensitivities to example distribution.

Simpler maintenance. When your task domain shifts, zero-shot prompts require updating the instruction. Few-shot prompts require updating both the instruction and the example set. The example set often lags, which means your prompt is implicitly teaching the model about the old distribution while the instruction asks it to handle the new one.

Zero-shot is the right starting point for tasks that are common in pretraining data: general sentiment analysis, Q&A over well-structured documents, simple classification of standard categories, summarization of common document types. These tasks don't need examples because the model has seen thousands of analogous examples during training. Adding more just constrains the output distribution toward your specific examples rather than the broader learned distribution.

When Static Few-Shot Makes Sense

Static few-shot — picking 2-5 examples and keeping them fixed across all requests — works well in a specific window of conditions.

The task should be genuinely unusual. If you're doing something domain-specific (classifying regulatory filings by type, tagging medical notes using a custom taxonomy, extracting structured fields from an unusual document format), examples help because the model hasn't seen many analogous tasks during pretraining. The examples calibrate it to your specific schema.

The examples should be high-quality and representative of your actual input distribution. Synthetic examples that you wrote in 20 minutes are almost always worse than real examples drawn from your production data. The gap is large enough to matter.

The task complexity should be moderate. Simple tasks don't benefit much from examples. Very complex tasks overwhelm a static 3-example set — the examples can't cover the variance.

And critically: you should have measured that examples help. The right protocol is to establish a zero-shot baseline on a representative validation set, then test 1, 2, 3, and 4 examples, and look at where performance peaks versus where it starts to plateau or decline. Most teams skip this and go straight to "we'll put in 4 examples" based on intuition.

Dynamic Example Retrieval: The Production-Grade Approach

When static few-shot is insufficient, the next step is dynamic example selection: for each incoming request, retrieve the most relevant examples from a pool rather than using the same fixed set.

The core idea is that different inputs benefit from different examples. A user query about contract law should draw legal-domain examples; a query about medical coding should draw clinical-domain examples. Using the same three generic examples for both is suboptimal. Semantic similarity-based retrieval — embedding both the incoming query and your example pool, then selecting the nearest neighbors — consistently outperforms fixed example sets.

In biomedical NER benchmarks, TF-IDF-based dynamic selection improved F1 scores by 7.3% compared to static few-shot in 5-shot settings. SBERT (sentence-BERT) embeddings perform better still for tasks with complex semantic structure. Graph-based retrieval approaches index your example pool as a structured graph and retrieve relevant subgraphs per query, which works well when your examples have relationships you want to preserve.

The architecture is straightforward:

  1. Curate a pool of high-quality labeled examples (50-500, depending on domain coverage you need).
  2. Embed each example with a fixed embedding model.
  3. At inference time, embed the incoming request, retrieve the top-k examples by cosine similarity, and insert them into the prompt.
  4. Track which example sets correlate with high-quality outputs so you can improve the pool over time.

The important caveat: dynamic retrieval shifts the quality ceiling from "how good are your 3 fixed examples" to "how good is your example retrieval." If your example pool is sparse or your embeddings don't capture the relevant semantic distinctions, dynamic retrieval underperforms static. You've added infrastructure complexity for a regression. Validate retrieval quality separately before attributing any performance gains to the selection mechanism.

Calibrating Away Bias

Even with well-selected examples, few-shot prompting introduces calibration problems that hurt reliability. Contextual calibration is a practical mitigation.

The approach: run your prompt against a null input ("N/A" or an empty query), observe the output distribution, and measure the model's baseline bias. If it predicts class A 70% of the time on null input, it has a strong prior toward A that will distort its outputs on real inputs. You can use this bias estimate to adjust prediction scores at inference time.

Batch calibration extends this further: gather a representative batch of inputs, compute the aggregate bias across the batch, and apply corrections. This approach requires essentially zero additional inference compute — you compute corrections once per batch, then apply a simple linear transformation to logits or prediction scores.

These calibration techniques don't eliminate the problem of bad example selection, but they make the model more robust to the biases that static examples introduce.

The Decision Framework in Practice

The decision isn't a one-time architectural choice. It's a measurement protocol.

Start with zero-shot. Build your evaluation set first — a representative sample of real inputs with ground-truth labels or quality assessments. Measure zero-shot performance and lock in that baseline. This is non-negotiable. Teams that skip this end up reasoning about whether few-shot "feels like it's working" rather than whether it's actually working.

Test static few-shot incrementally. Add 1 example, measure. Add 2, measure. Add 3, measure. Watch for the curve. Most tasks have a sweet spot between 1 and 4 examples. If you're still seeing gains at 4, you're probably in territory where dynamic retrieval is worth exploring.

Test on out-of-distribution inputs. Your examples were drawn from some slice of your input space. Make sure your few-shot variant doesn't degrade on inputs outside that slice. Position and majority label bias tend to surface most clearly on OOD inputs.

If you need dynamic retrieval, measure retrieval quality independently. Does a similarity-based example set actually contain semantically relevant examples? Are they formatted consistently? Do they represent the actual task schema? Debugging end-to-end without decomposing retrieval quality from generation quality is slow and confusing.

Monitor in production. Static example sets go stale as your user base and task distribution evolve. If your users start asking questions your examples don't cover, few-shot performance will degrade silently — the model will fall back to pattern-matching on incidentally similar examples rather than the semantically correct ones. Tracking output quality over time, separately from user engagement metrics, is the only way to catch this before it becomes a complaint.

Where This Leaves You

The zero-shot vs. few-shot decision is an empirical question, not a rule. The production-safe approach is: zero-shot first, static few-shot only after measuring that it helps, dynamic retrieval when the task diversity exceeds what static examples can cover.

The failure mode to avoid is prompt engineering by accumulation — steadily adding examples over weeks because each addition seemed to help in informal testing, until you end up with a fragile prompt that works well on the examples you wrote and poorly on the tail of your real distribution. Measure from the start, keep a validation set that represents real users, and don't ship until you've checked both the peak and the out-of-distribution cases.

Few-shot prompting is a tool, not a default. Use it when it helps. Measure whether it helps.

References:Let's stay in touch and Follow me for more thoughts and updates