The Few-Shot Saturation Curve: Why Adding More Examples Eventually Hurts
A team testing Gemini 3 Flash on a route optimization task watched their model score 93% accuracy at zero-shot. They added examples, performance climbed — and then at eight examples it collapsed to 30%. That's not noise. That's the few-shot saturation curve biting hard, and it's a failure mode most engineers only discover after deploying a prompt that seemed fine at four examples and broken at twelve.
The intuition that more examples is strictly better is wrong. The data across 12 LLMs and dozens of task types shows three distinct failure patterns: steady plateau (gains flatten), peak regression (gains then crash), and selection-induced collapse (gains that evaporate when you switch example retrieval strategy). Understanding which pattern you're in changes how you build prompts, when you give up on few-shot entirely, and whether you should be fine-tuning instead.
The Three Failure Modes
Not all few-shot degradation looks the same.
Peak regression is the most dramatic and the most diagnostic. The model improves from 0 to 4 examples, peaks, then drops sharply. The Gemini 3 Flash case above is an extreme example — 63-point decline from peak to 8-shot. But the pattern is common across models: Qwen 3.5 dropped from 56% to 0% on a code-fixing task after receiving more examples. This happens when the distribution of your examples starts teaching the model something subtly wrong.
Steady plateau is the benign version. Gains flatten out, marginal improvement approaches zero, but performance doesn't crater. This is where most engineers live unknowingly — spending tokens on examples that stopped contributing after example four. The cost is wasteful rather than catastrophic.
Selection-induced collapse is the most insidious. Fixed examples perform well; dynamically retrieved examples (via TF-IDF or semantic similarity) cause a 58% relative performance drop on the same task with the same model. The content of the examples doesn't change your average, but the selection strategy determines whether you reliably hit peak performance or randomly trigger failure cases. Production systems using retrieval-based example selection are exposed to this.
Why More Examples Start to Hurt
The failure modes make sense once you understand what models are actually doing with your examples.
Models learn format and distribution, not mappings. This is the uncomfortable finding from Min et al. (2022) that many prompt engineers haven't fully absorbed: randomly replacing the correct labels in few-shot examples barely affects performance. The model isn't learning "when input looks like X, output Y" — it's learning the output format, the vocabulary register, the structure of valid responses. This means past a certain point, you're not teaching it new input-output mappings, you're just adding noise.
The lost-in-the-middle effect eats your signal. Transformer attention is not uniform across context. Models attend well to the beginning and end of a prompt; the middle gets soft-focused. Pile enough examples into a prompt and the actual task instruction gets buried in the middle of a context that the model processes with degraded attention. The examples that should be helping become interference. Studies on long-context prompts consistently show 30%+ accuracy drops when critical information sits in the middle 60% of the context window.
Spurious correlations compound with example count. Each example is a datapoint from which the model can infer implicit rules. With two or three examples, those rules are constrained — there aren't enough patterns to overfit. With ten or twenty examples, the model starts picking up structure you didn't intend. If your carefully curated examples happen to share a sentence structure, a vocabulary register, or a domain frequency bias, the model learns that spurious association alongside the legitimate one. It then applies the spurious rule to inputs that superficially match the pattern, regardless of whether the underlying task logic applies.
A concrete illustration: if you're classifying customer support tickets and your examples happen to all use the phrase "urgent" in complaints but never in questions, the model will start scoring "urgent" as a strong predictor of complaint class — even if the ticket is asking an urgent product question. Add more examples that accidentally reinforce that bias and you've dug a deeper hole, not a shallower one.
Attention is quadratic, context isn't free. Extending from 2 to 20 examples doesn't just add linear token cost. Transformer self-attention scales quadratically in context length, which means the model is distributing its finite attention capacity across an exponentially growing set of pairwise relationships. The computational pressure doesn't manifest as an error; it manifests as softened attention to any individual signal, including your task instruction.
The Saturation Points by Task Type
Saturation isn't universal — it's task and model specific, which is why the failure mode is so hard to catch without explicit testing.
Translation tasks scale the furthest. For low-resource languages, some models continue improving through hundreds of examples. The pattern is consistent enough that low-resource translation is arguably the best justification for many-shot prompting. The task format is rigid, the mapping is deterministic, and examples add genuine signal.
Mathematical reasoning peaks earlier. MATH dataset performance typically peaks around 125 examples, then degrades. GPQA (graduate-level science reasoning) shows the same — 125 shots, then decline at 250. Abstract reasoning tasks seem to saturate faster than procedural ones: once the model has learned the answer format, additional examples don't teach it to reason better.
Classification and extraction tasks saturate fastest. Most industry benchmarks show the 2-5 example range captures 80% of achievable gains. Past 8 examples, improvements are in the noise for these task types. This is the most common production workload, which means most engineers are operating in a regime where their examples have already stopped helping.
- https://dev.to/shuntarookuma/when-more-examples-make-your-llm-worse-discovering-few-shot-collapse-106i
- https://dev.to/shuntarookuma/i-tested-12-llms-with-few-shot-examples-the-results-were-not-what-i-expected-2de6
- https://arxiv.org/html/2404.11018v3
- https://www.morphllm.com/context-rot
- https://research.trychroma.com/context-rot
- https://news.mit.edu/2025/shortcoming-makes-llms-less-reliable-1126
- https://arxiv.org/html/2508.04063v1
- https://mem0.ai/blog/few-shot-prompting-guide
- https://www.promptingguide.ai/techniques/fewshot
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
