Skip to main content

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.

That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.

The Failure Mode the New Model Doesn't Have

Few-shot examples are usually chosen against a specific model's specific weaknesses. The senior engineer notices the model keeps emitting trailing commas in JSON output. They add an example showing the correct shape. They notice it confuses dates in DD/MM versus MM/DD when the field name is ambiguous. They add another example. By the time the prompt ships, those examples are a compressed history of the failure modes the team caught during evaluation.

When the model upgrades, those failure modes shift. The new generation has different formatting priors, different instruction-following behavior, and different blind spots. The trailing-comma example that fixed a real bug last quarter is now correcting a failure mode the new model doesn't have — and the model's exposure to that example is not free. It costs context tokens, it introduces a stylistic bias the model doesn't need, and worst of all, it can over-correct: a model that already gets JSON right can be nudged into eccentric formatting choices because the demonstrations imply the task is about formatting rather than the actual semantics.

There is a documented version of this called few-shot collapse: as you add examples, performance climbs to a peak and then falls. On some models the fall is steep — Gemma 7B has been measured dropping from 77.9% accuracy to 39.9% as the example count grows past its sweet spot. The mechanisms are well-understood: longer context degrades attention to the actual task, examples stuck in the middle of a long prompt get lost, and demonstrations that contradict the model's pre-training priors create patterns the model over-indexes on. None of those mechanisms care that your examples were good last quarter.

OpenAI has been telling people this directly. The official guidance for newer reasoning models is that "clear instructions and well-defined constraints often work better than adding examples," and that few-shot prompts can reduce performance when the task requires heavy reasoning. The decoder community has summarized OpenAI's migration advice as: treat the new model "as a new model family to tune for, not a drop-in replacement," and "begin migration with a fresh baseline instead of carrying over every instruction from an older prompt stack." Few-shot examples are the heaviest part of that older prompt stack. They are also the part teams are least willing to delete.

The Per-Example Utility Audit

The hard fact is that "did the prompt regress overall?" is not a useful enough question after a model upgrade. The aggregate eval can come out flat or slightly positive while individual examples in your few-shot block are silently doing harm. What you need is a per-example contribution measurement, run as part of every model migration.

The mechanic is simple to describe and unfun to operate. For each example EiE_i in your prompt:

  • Run the eval set with EiE_i in place. Score it.
  • Run the eval set with EiE_i ablated (removed). Score it.
  • The delta is EiE_i's utility on this model.

Examples whose contribution is positive stay. Examples whose contribution is zero or negative are pruned. Examples that newly help on the new model — perhaps a previously borderline example that the new model finally uses well — get promoted out of the candidate pool.

Three details matter in practice. First, the eval set has to be substantial enough that the per-example signal isn't noise. A 50-case eval gives you almost no statistical power per example; you want at least a few hundred cases so that a 1-2 point delta is actually distinguishable from sampling jitter. Second, you have to run with deterministic settings (temperature 0, fixed seed where the API exposes one) or run multiple trials and average — otherwise the same example will measure differently from one audit to the next. Third, examples can interact: removing one example may shift how another performs, because the model is reading them as a sequence. A first-pass audit that ablates examples one at a time will catch the dominant effects, but a second pass that ablates pairs is worth the cost on prompts where the few-shot block is doing real work.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates