The Few-Shot Example Your Model Treated as Binding Precedent
A user submits a question. Your model produces an answer that is confidently wrong in a very specific way: the format is perfect, the reasoning is well-structured, and a particular qualifier — one that does not apply to this question at all — appears in exactly the place a similar qualifier appeared in example three of your system prompt. Not a hallucination. Not a prompt injection. The model did precisely what the examples taught it to do, on a question those examples were never meant to cover.
This is the failure mode that few-shot prompting actively encourages and that most eval suites are structurally blind to. Your examples are not neutral demonstrations of "what good looks like." They are case law. The model selects the closest match by surface tokens and applies the precedent — including its constraints — to whatever case is in front of it.
The deeper problem is that the same mechanism responsible for in-context learning at all — the model's ability to recognize a pattern in the prompt and continue it — is the mechanism that binds the wrong precedent to the wrong query. You cannot turn off precedent-following without turning off in-context learning. You can only shape what gets matched and what gets generalized.
Why models bind to examples on surface tokens
The mechanical story is well-studied. Induction heads in the attention layers scan the context for prior occurrences of the current token and copy what followed last time. When the prompt contains a labeled example — "input: X, output: Y" — and the user query shares salient tokens with X, the induction circuit increases the probability of generating something close to Y. This is the same circuit that makes few-shot prompting work; there is no separate "use this example" toggle.
Recent mechanistic interpretability work shows induction heads are necessary for abstract pattern recognition in few-shot ICL — ablating them tanks performance on exactly the kinds of structured tasks practitioners use few-shot prompting for. They are also not the only mechanism, but they are the most legible one and the one whose failure mode looks most like "this query reminded the model of that example, and now the answer is shaped like that example's answer."
Empirically, the binding to surface form gets stronger with more examples up to a point, then sharply worse. A 2025 study on "over-prompting" found that incorporating excessive domain-specific examples can degrade performance in current frontier LLMs — and that the optimal example count is model-specific and often surprisingly low. Other work shows that beyond roughly four examples, large models start overfitting to the example surface form and lose generality. The intuition that "more examples means better generalization" breaks down well before the examples stop seeming useful to the prompt author.
The same literature finds that prompt formatting choices — whitespace, delimiters, separator tokens, capitalization in labels — can swing accuracy by up to 76 points on classification tasks for open-source models. The model is not reading examples the way you wrote them. It is reading examples as token sequences and matching the user query against those sequences as token sequences. The semantic intent you thought you encoded in the example is, from the model's perspective, one feature among many surface features it is also matching on.
The eval coverage gap that lets the regression ship
Most eval suites are built the way most documentation is built: each example illustrates a canonical case. Eval cases that match each in-prompt example closely, plus a held-out set of cases that match none of them. The slice that matters — cases that match the surface of one example while sharing the intent of a different example — does not exist in the suite, because the engineer building the suite was thinking about coverage of inputs and not coverage of precedent retrieval.
The result is a measurement loop that rewards the model for doing the wrong thing. If the eval set is dominated by close matches to in-prompt examples, then a model that aggressively pattern-matches to those examples will score well. If the regression is "the model treats the closest example as binding," your eval cannot see it because every eval case is close to exactly one example. The regression is statistically invisible until a user submits a query that lives in the gap between two examples — by which point the answer has already shipped.
The fix at the eval layer is to add an adversarial slice with one distinguishing property: each input must share the salient surface tokens of one example and the intent of a different example. The score on this slice is what tells you whether the model is applying precedent on substance or on surface. If you do not have this slice, you do not have an eval for few-shot precedent overreach; you have an eval for whether the model can copy examples it has been shown.
This is also why "we passed the eval and shipped" is not, by itself, an argument that the system works. The eval certifies the model on whatever the eval measures. A blind spot in coverage is a blind spot in certification. Until the eval includes near-miss pairs where surface and intent diverge, the certification says "the model is good at pattern-matching examples" — which is a statement about the model's behavior, not about whether that behavior produces correct answers on real traffic.
Curating examples for adversarial coverage, not for clarity
- https://arxiv.org/abs/2509.13196
- https://arxiv.org/pdf/2407.07011
- https://arxiv.org/pdf/2310.11324
- https://arxiv.org/pdf/2507.22887
- https://arxiv.org/pdf/2302.11042
- https://arxiv.org/pdf/2501.15030
- https://learnprompting.org/docs/advanced/few_shot/k_nearest_neighbor_knn
