Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model
A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.
That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.
The Failure Mode the New Model Doesn't Have
Few-shot examples are usually chosen against a specific model's specific weaknesses. The senior engineer notices the model keeps emitting trailing commas in JSON output. They add an example showing the correct shape. They notice it confuses dates in DD/MM versus MM/DD when the field name is ambiguous. They add another example. By the time the prompt ships, those examples are a compressed history of the failure modes the team caught during evaluation.
When the model upgrades, those failure modes shift. The new generation has different formatting priors, different instruction-following behavior, and different blind spots. The trailing-comma example that fixed a real bug last quarter is now correcting a failure mode the new model doesn't have — and the model's exposure to that example is not free. It costs context tokens, it introduces a stylistic bias the model doesn't need, and worst of all, it can over-correct: a model that already gets JSON right can be nudged into eccentric formatting choices because the demonstrations imply the task is about formatting rather than the actual semantics.
There is a documented version of this called few-shot collapse: as you add examples, performance climbs to a peak and then falls. On some models the fall is steep — Gemma 7B has been measured dropping from 77.9% accuracy to 39.9% as the example count grows past its sweet spot. The mechanisms are well-understood: longer context degrades attention to the actual task, examples stuck in the middle of a long prompt get lost, and demonstrations that contradict the model's pre-training priors create patterns the model over-indexes on. None of those mechanisms care that your examples were good last quarter.
OpenAI has been telling people this directly. The official guidance for newer reasoning models is that "clear instructions and well-defined constraints often work better than adding examples," and that few-shot prompts can reduce performance when the task requires heavy reasoning. The decoder community has summarized OpenAI's migration advice as: treat the new model "as a new model family to tune for, not a drop-in replacement," and "begin migration with a fresh baseline instead of carrying over every instruction from an older prompt stack." Few-shot examples are the heaviest part of that older prompt stack. They are also the part teams are least willing to delete.
The Per-Example Utility Audit
The hard fact is that "did the prompt regress overall?" is not a useful enough question after a model upgrade. The aggregate eval can come out flat or slightly positive while individual examples in your few-shot block are silently doing harm. What you need is a per-example contribution measurement, run as part of every model migration.
The mechanic is simple to describe and unfun to operate. For each example in your prompt:
- Run the eval set with in place. Score it.
- Run the eval set with ablated (removed). Score it.
- The delta is 's utility on this model.
Examples whose contribution is positive stay. Examples whose contribution is zero or negative are pruned. Examples that newly help on the new model — perhaps a previously borderline example that the new model finally uses well — get promoted out of the candidate pool.
Three details matter in practice. First, the eval set has to be substantial enough that the per-example signal isn't noise. A 50-case eval gives you almost no statistical power per example; you want at least a few hundred cases so that a 1-2 point delta is actually distinguishable from sampling jitter. Second, you have to run with deterministic settings (temperature 0, fixed seed where the API exposes one) or run multiple trials and average — otherwise the same example will measure differently from one audit to the next. Third, examples can interact: removing one example may shift how another performs, because the model is reading them as a sequence. A first-pass audit that ablates examples one at a time will catch the dominant effects, but a second pass that ablates pairs is worth the cost on prompts where the few-shot block is doing real work.
The output of the audit is not a single thumbs-up. It is a table: each example, each model version, the eval delta. That table tells you which examples have weakened, which have flipped sign, and which have grown stronger. Without it, you are guessing.
Provenance: Tie Every Example to Its Motivating Eval Case
Most few-shot examples in production prompts have no documented reason to exist. They were added during a specific debugging session, and the only person who remembers why is no longer on the team. When that history is missing, the audit becomes adversarial: the engineer running the migration has to argue against an example whose value they don't understand, and the safe move is always to keep it.
The discipline that fixes this is provenance. Every few-shot example in the prompt should be tied, in storage, to the motivating eval case it was added to address. A reviewer should be able to look at a prompt and see "this example exists because of test #47" rather than "Phil added it last spring." When the audit then reports that this example's contribution went negative on the new model, the reviewer can pull up test #47, run it on the new model without the example, and confirm directly: the failure mode the example was correcting is gone.
This is not theoretical bookkeeping. It is what makes pruning safe. Without provenance, deleting an example feels like deleting a Chesterton's fence — nobody knows why it was put there, so the cautious instinct is to leave it. With provenance, the deletion is justified by a specific, measurable claim: "test #47 passes on the new model whether or not this example is present, therefore the example is no longer pulling its weight." That claim can be checked. The example can be removed without superstition.
The implementation is unglamorous. A few-shot library where each entry has fields for the eval case ID it was added to address, the model version it was first validated against, the date added, and the most recent audit result. Some teams put this in a YAML file next to the prompt. Some put it in the same vector store that holds their eval set, with explicit cross-references. The format does not matter; the discipline does.
The Monotonic-Growth Trap
Without an audit and without provenance, the predictable failure mode is monotonic growth. Examples accumulate. Each model migration adds new examples to fix new edge cases the new model exposes. None of the old examples get removed, because nobody can prove they're hurting.
After two or three model upgrades, the few-shot block is a sediment of corrections — some still load-bearing, many obsolete, a few actively harmful. The prompt has gotten longer. The latency has crept up. The cost per call has gone up. The output quality has not, and may have quietly slid. The team has no way to tell because the only thing they measure is the aggregate, and the aggregate hides the per-example damage.
The cultural side of this is that nobody on the team feels licensed to delete an example. Adding an example is a defensible act — you saw a bad output, you wrote a corrective demo, you shipped it. Removing an example feels like asking for a regression. The audit is what reverses that asymmetry: with a measured negative contribution in hand, removing the example is now the defensible act, and keeping it requires justification. Without the audit, the prompt only ever grows.
There is a simpler heuristic that approximates this discipline in shops that aren't ready for the full audit machinery: a hard cap on the few-shot block. Five examples, six examples, whatever the team can defend. When a new example needs to be added, an old example has to be cut to make room. The forced trade-off prevents accumulation, even without per-example measurement. It is a worse instrument than the audit, but it is far better than monotonic growth.
Few-Shots Are Part of the Model-Prompt Pair
The deeper architectural lesson is that few-shot examples cannot be versioned independently of the model. A prompt that contains few-shot examples is not portable across model versions in any meaningful sense. The examples are tuned to a specific model's behavior, and they encode assumptions about what the model gets wrong and how it responds to demonstrations. Pinning the prompt without re-evaluating the examples on the target model is migration in name only.
The implication for the eval suite is that "the prompt passes the eval" is a model-specific claim. An eval suite that tests the prompt only against the model it was developed against will miss the moment the prompt rots on the next model. The discipline that catches this is the cross-version eval: the same prompt, the same eval set, run against both the current model and the candidate model, with per-example contribution measured separately on each side. The deltas tell you which examples to bring forward, which to drop, and which to rewrite.
The discipline that catches this earlier is to treat the prompt and the model as a versioned pair: prompt-v37 + model-2026-04 is a different artifact from prompt-v37 + model-2026-09, and the eval results on one do not transfer to the other. Production deployments should pin both, the changelog should record both, and the upgrade process should re-validate the pair as a unit. The few-shot block, in particular, is the place where the model and the prompt are most tightly coupled, and is therefore the first place that breaks when one of them changes.
The teams that get this right do not stop using few-shot examples. They keep them — for the cases where examples genuinely outperform instructions, which is real and not going away. What they stop doing is treating the few-shot block as a permanent asset. They treat it the way they treat any other part of the system that depends on the model's behavior: instrumented, audited, pruned, and re-validated every time the underlying model moves. The examples that survive that process are the examples that are actually earning their tokens. Everything else is rot, and rot compounds.
- https://shuntaro-okuma.medium.com/when-more-examples-make-your-llm-worse-discovering-few-shot-collapse-d3c97ff9eb01
- https://arxiv.org/html/2507.05573v1
- https://cookbook.openai.com/examples/gpt-5/prompt-optimization-cookbook
- https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting_guide
- https://developers.openai.com/api/docs/guides/prompt-guidance
- https://the-decoder.com/openai-says-old-prompts-are-holding-gpt-5-5-back-and-developers-need-a-fresh-baseline/
- https://arxiv.org/html/2601.22025v1
- https://agenta.ai/blog/prompt-drift
- https://www.statsig.com/perspectives/slug-prompt-regression-testing
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/multishot-prompting
