Skip to main content

Stale Few-Shot Examples and the Half-Life Your Prompt Repo Ignores

· 10 min read
Tian Pan
Software Engineer

Open the system prompt of any AI feature that has been in production for more than nine months. Scroll past the role description, past the formatting rules, past the safety guardrails. Stop at the block titled <examples> or ## Examples or whatever your team called it the day someone copied the first three good Slack threads into a code block. Read them. There is a 60% chance at least one of them references a feature that has been renamed, a button that no longer exists, or a workflow the product manager quietly killed two quarters ago.

The decay is not visible from the eval dashboard. The eval scores are green. They have been green for months. They are green because the eval set was authored against the same product surface the few-shots reference, and the two have aged together in lockstep. The model is performing a flawless impression of last year's product, on a test set that grades it for being faithful to last year's product, while real users interact with this year's product and quietly tolerate the resulting confabulations. This is the half-life nobody puts in the LLMOps roadmap.

Few-shot examples are treated as a one-shot artifact: someone hand-picked them during the initial prompt iteration, demonstrated a 4-point lift over zero-shot, checked them in, and moved on. The rest of the prompt — the role description, the tool schemas, the output format — gets edited regularly because changes to those surfaces produce immediate, visible failures. Few-shots produce slow, invisible failures. They sit in the prompt like a fossil layer, encoding the product taxonomy of whichever week they were captured.

Why The Decay Stays Invisible

The most uncomfortable property of stale few-shots is that they fail in a shape your eval suite was built to miss. A standard eval set scores correctness on a fixed grid of input-output pairs. If the few-shot says "when the user asks about the Insights Dashboard, respond with X," and your eval case also says "user asks about the Insights Dashboard, expected response X," both artifacts are pointing at the same archaeological layer of the product. The eval passes. The prompt is right. The product, six months later, calls the thing the Analytics Workspace, and the model continues to call it the Insights Dashboard with serene confidence.

This is the eval-coupling problem nobody names explicitly. Few-shot examples and eval cases tend to be co-authored — same engineer, same week, same product snapshot. They share a single point of truth, which means they share a single failure mode. When the product moves, both artifacts go stale together, and nothing in your CI catches it because neither one is anomalous relative to the other.

The second invisibility vector is that few-shot examples teach behavior more than they teach content. A model that has seen three examples of "respond in two paragraphs with a bullet list" will respond in two paragraphs with a bullet list, even if the content of the examples is obsolete. The structural lift survives the staleness, which is why the eval numbers don't crash. The semantic accuracy degrades by a degree at a time, beneath the threshold of any aggregate metric.

The third invisibility vector is the survivorship bias of production logs. The users who tried the renamed feature and got the wrong response either gave up, filed a low-priority support ticket, or noted privately that the AI was "a bit off" and worked around it. None of those signals route back to the prompt repo. Meanwhile, the users on the legacy paths still get correct answers, because the prompt is correct for the world they're in. The aggregate satisfaction metric stays flat. The drift hides inside the long tail.

What Actually Rots Inside A Stale Few-Shot

Walk through the failure modes one at a time, because "stale" is too vague to act on. The interesting decay is mechanical.

Renamed features: the example references Insights Dashboard; the product now calls it Analytics Workspace. The model uses the old name in responses to new users, who Google it and find nothing. Support gets the ticket.

Removed surfaces: the example walks the user through "click the gear icon in the top-right and choose Export." The gear icon was replaced by a context menu in the row itself eight months ago. The model gives confident, wrong navigation instructions that users follow until they hit a dead end.

Refactored API shapes: the example shows a tool call returning {"user_id": ..., "subscription_tier": "pro"}. The API now returns {"user_id": ..., "plan": {"tier": "pro", "billing_cycle": "annual"}}. The runtime tool call still works because your code adapts, but the model's downstream reasoning quotes the old field shape, which leaks into responses as phantom keys that don't exist.

Drifted persona and tone: the example was written when the brand voice was "playful tech-bro." This year's voice is "calm and consultative." The few-shot still shows exclamation marks and "Awesome!" openers. Every response carries a faint whiff of the previous brand era.

Stale policy boundaries: the example demonstrates a refusal pattern for a content category that has since been moved from "refuse" to "answer with disclosure." The model continues to refuse on cases the policy now permits, and the policy team has no idea why.

Outdated reasoning patterns: the example shows a three-step decomposition that made sense for the older, smaller model. The current model can solve the same problem in one step, and forcing the verbose decomposition wastes tokens and adds latency. The example is teaching a worse strategy than the model would invent on its own.

Each of these is invisible to an aggregate score and obvious to anyone who reads the prompt against the current product. The work is not detection — it is the discipline of routinely re-reading.

The Maintenance Discipline That Actually Works

The fix is not "review the prompt more often." Every team says that, nobody does it, and the items that get reviewed are the ones someone is actively editing. Few-shots need their own maintenance discipline because they decay in a different shape from the rest of the prompt and have to be defended explicitly.

The first piece is provenance metadata, treated as a hard requirement on every example. An example without metadata is a dangling pointer — nobody knows what failure mode it was guarding, which release of the product it references, or whether it survived the last model migration. The metadata block I push teams toward includes: the date the example was authored, the product version or feature flag state it assumes, the failure mode it was added to address (with a link to the incident or eval case if one exists), the model version it was calibrated against, and the engineer who owns it. None of this needs to be in the prompt itself — keep it in a sidecar file, a docstring above the example, or a structured comment block that your prompt-build step strips before assembly. The point is that future-you can answer "why is this example here?" without doing archaeology.

The second piece is scheduled review, decoupled from prompt edits. The default cadence I've seen work is quarterly for examples older than six months, plus a forced review on every model migration. The review is not "do the eval scores still pass" — that question is already answered and irrelevant. The review is two engineers reading every example aloud against the current product and asking: does this still describe how things work? If no, fix or delete. If yes, bump the "last reviewed" date in the metadata. The whole exercise takes an hour for a prompt with twelve examples and saves a quarter of incident-triage time downstream.

The third piece is eval decoupling. The structural fix for the green-eval problem is to author eval cases against a different point of truth from the few-shot examples. The easiest way to do this is to source eval cases from production traces — real user inputs, not curated ones — and grade them with a judge that has access to the current product documentation. The eval set still drifts, but it drifts on a different schedule from the prompt repo, which means at least one of them will catch the decay. Better: stand up a small "feature-vocabulary" eval that explicitly probes for the renamed surfaces — give the model a query and check that its response uses the current name, not the historical one. That eval is fast, cheap, and will catch the bulk of the failure modes above.

The fourth piece is a deletion bias. The strongest team I've watched on this maintains a rule: an example must justify its presence at every quarterly review, or it gets deleted. Examples are not free. They consume tokens, they shape behavior in ways that aren't always desired, and they accumulate. A prompt repo that grows monotonically is a prompt repo whose half-life is dropping. The teams that ship reliable AI features are the teams that delete things.

A Concrete Review Protocol

For each few-shot example, in order, ask:

  • Does it reference any product surface — feature name, button, URL, tool schema, API field — that has changed since the example was authored?
  • Does its tone, length, or structure match the current brand voice and current model capability?
  • Does the failure mode it was added to address still exist, or has the model improved past it?
  • Is there a more recent production trace that demonstrates the same lesson better?
  • Would the prompt still pass its evals if this example were removed?

The last question is the most useful and least asked. If removing an example doesn't move the eval score, the example is either redundant with another example, redundant with the rest of the prompt, or guarding a failure mode you no longer have. In any of those cases, the example is taking up tokens and shaping behavior for no measurable lift, and it should be deleted. The eval-score-on-removal check is a cheap test you can run in a few minutes against any non-trivial example, and most teams never run it.

For each example that survives, update the provenance metadata, bump the "last reviewed" date, and move on. The whole protocol is forty minutes for a typical prompt. The compounded value is that your prompt repo stops accumulating fossil layers, your responses stop quietly referencing extinct features, and your eval dashboard becomes a metric you can actually trust.

The Half-Life Is Real And You Can Choose It

Few-shot examples are not configuration. They are not boilerplate. They are the most product-specific, most context-bound, fastest-decaying part of the prompt, and they have been treated like configuration for two years because the field hasn't named the problem yet. The eval dashboards that should catch the decay don't, because the dashboards and the examples share a single point of truth that ages together.

Pick the cadence at which you re-read your few-shots. If you don't pick one, the cadence will be "whenever a real user complains loudly enough that someone goes spelunking in the prompt." That is the default, and it is slower than quarterly and more painful than scheduled. The teams that ship AI features the longest are not the teams with the largest prompt repos. They are the teams that delete the most examples per quarter.

References:Let's stay in touch and Follow me for more thoughts and updates