Skip to main content

The Few-Shot Saturation Curve: Why Adding More Examples Eventually Hurts

· 9 min read
Tian Pan
Software Engineer

A team testing Gemini 3 Flash on a route optimization task watched their model score 93% accuracy at zero-shot. They added examples, performance climbed — and then at eight examples it collapsed to 30%. That's not noise. That's the few-shot saturation curve biting hard, and it's a failure mode most engineers only discover after deploying a prompt that seemed fine at four examples and broken at twelve.

The intuition that more examples is strictly better is wrong. The data across 12 LLMs and dozens of task types shows three distinct failure patterns: steady plateau (gains flatten), peak regression (gains then crash), and selection-induced collapse (gains that evaporate when you switch example retrieval strategy). Understanding which pattern you're in changes how you build prompts, when you give up on few-shot entirely, and whether you should be fine-tuning instead.

The Three Failure Modes

Not all few-shot degradation looks the same.

Peak regression is the most dramatic and the most diagnostic. The model improves from 0 to 4 examples, peaks, then drops sharply. The Gemini 3 Flash case above is an extreme example — 63-point decline from peak to 8-shot. But the pattern is common across models: Qwen 3.5 dropped from 56% to 0% on a code-fixing task after receiving more examples. This happens when the distribution of your examples starts teaching the model something subtly wrong.

Steady plateau is the benign version. Gains flatten out, marginal improvement approaches zero, but performance doesn't crater. This is where most engineers live unknowingly — spending tokens on examples that stopped contributing after example four. The cost is wasteful rather than catastrophic.

Selection-induced collapse is the most insidious. Fixed examples perform well; dynamically retrieved examples (via TF-IDF or semantic similarity) cause a 58% relative performance drop on the same task with the same model. The content of the examples doesn't change your average, but the selection strategy determines whether you reliably hit peak performance or randomly trigger failure cases. Production systems using retrieval-based example selection are exposed to this.

Why More Examples Start to Hurt

The failure modes make sense once you understand what models are actually doing with your examples.

Models learn format and distribution, not mappings. This is the uncomfortable finding from Min et al. (2022) that many prompt engineers haven't fully absorbed: randomly replacing the correct labels in few-shot examples barely affects performance. The model isn't learning "when input looks like X, output Y" — it's learning the output format, the vocabulary register, the structure of valid responses. This means past a certain point, you're not teaching it new input-output mappings, you're just adding noise.

The lost-in-the-middle effect eats your signal. Transformer attention is not uniform across context. Models attend well to the beginning and end of a prompt; the middle gets soft-focused. Pile enough examples into a prompt and the actual task instruction gets buried in the middle of a context that the model processes with degraded attention. The examples that should be helping become interference. Studies on long-context prompts consistently show 30%+ accuracy drops when critical information sits in the middle 60% of the context window.

Spurious correlations compound with example count. Each example is a datapoint from which the model can infer implicit rules. With two or three examples, those rules are constrained — there aren't enough patterns to overfit. With ten or twenty examples, the model starts picking up structure you didn't intend. If your carefully curated examples happen to share a sentence structure, a vocabulary register, or a domain frequency bias, the model learns that spurious association alongside the legitimate one. It then applies the spurious rule to inputs that superficially match the pattern, regardless of whether the underlying task logic applies.

A concrete illustration: if you're classifying customer support tickets and your examples happen to all use the phrase "urgent" in complaints but never in questions, the model will start scoring "urgent" as a strong predictor of complaint class — even if the ticket is asking an urgent product question. Add more examples that accidentally reinforce that bias and you've dug a deeper hole, not a shallower one.

Attention is quadratic, context isn't free. Extending from 2 to 20 examples doesn't just add linear token cost. Transformer self-attention scales quadratically in context length, which means the model is distributing its finite attention capacity across an exponentially growing set of pairwise relationships. The computational pressure doesn't manifest as an error; it manifests as softened attention to any individual signal, including your task instruction.

The Saturation Points by Task Type

Saturation isn't universal — it's task and model specific, which is why the failure mode is so hard to catch without explicit testing.

Translation tasks scale the furthest. For low-resource languages, some models continue improving through hundreds of examples. The pattern is consistent enough that low-resource translation is arguably the best justification for many-shot prompting. The task format is rigid, the mapping is deterministic, and examples add genuine signal.

Mathematical reasoning peaks earlier. MATH dataset performance typically peaks around 125 examples, then degrades. GPQA (graduate-level science reasoning) shows the same — 125 shots, then decline at 250. Abstract reasoning tasks seem to saturate faster than procedural ones: once the model has learned the answer format, additional examples don't teach it to reason better.

Classification and extraction tasks saturate fastest. Most industry benchmarks show the 2-5 example range captures 80% of achievable gains. Past 8 examples, improvements are in the noise for these task types. This is the most common production workload, which means most engineers are operating in a regime where their examples have already stopped helping.

Coding and code verification tasks land in the middle. CodeLlama studies show saturation around six examples. Past that, you're adding tokens that hurt context efficiency without adding accuracy.

Finding Your Saturation Point

If you're not systematically testing this, you don't know where you are on the curve. The methodology is straightforward.

Start with a baseline set of 50-100 test cases representative of your production distribution. Run the same test at 0, 1, 2, 4, and 8 examples — keeping the examples identical across runs, changing only count. Plot accuracy against example count. Look for the inflection point where marginal gain drops below 1% per additional example, or where the curve reverses.

Three things make this test fail in practice:

  • Using different examples at each shot count. You're measuring example quality variation, not saturation. The 2-shot and 4-shot runs need to use the same first two examples, with the 4-shot adding two more.
  • Testing on a sample too small to detect 1-2% accuracy changes. With 10 test cases you can't distinguish saturation from noise. You need at least 50.
  • Not testing your actual example selection strategy. If production uses retrieval-based selection, your fixed-example saturation test tells you the best-case scenario. Test with your actual retrieval method.

The decision rule at the end is simple: identify the shot count where accuracy gain dropped below 1% or reversed, then subtract one. That's your optimal shot count. If you're currently above it, you're paying token cost for degraded performance.

When to Stop Prompting and Start Fine-Tuning

The saturation curve has a natural exit. If your task has a defined input-output format, your examples are high quality, and you're still not hitting acceptable performance at 8 examples — you've hit the ceiling of what few-shot prompting can achieve for this model on this task.

The practical threshold for fine-tuning is 50-100 labeled examples for closed models (GPT-4o, Claude) and somewhat more for open-weight models. Below 50 examples, few-shot is usually better: the fine-tuning update is too noisy to beat a clean in-context demonstration, and you lose the ability to iterate rapidly.

The case for fine-tuning isn't just accuracy ceiling. It's also about token efficiency. A fine-tuned model that has internalized your task format produces the same quality output from a zero-shot prompt as a base model produces from a 10-shot prompt — at 1/10th the token cost per request. At scale, this compounds. If you're running millions of inferences, the break-even point on fine-tuning investment often arrives within weeks.

One caveat: fine-tuned models typically underperform on out-of-distribution inputs. In-context learning shows better generalization when the test distribution shifts from the training distribution. If your production inputs drift significantly over time, few-shot with dynamic example selection may stay more robust than fine-tuning, even if its peak accuracy is lower.

Practical Defaults for Engineers

Given the research, here are defaults that minimize accidental degradation:

  • Start with 2-3 examples. This range captures most of the format-learning benefit without risking spurious correlation overfitting. Get baseline metrics before adding more.
  • Increment deliberately. Go from 3 to 5, measure, then 5 to 8. Don't jump straight to "more is better."
  • Place best examples first and last. Given the lost-in-the-middle effect, the examples that define your critical edge cases should be at the start and end of your example block, not buried in the middle.
  • Use fixed examples before trying retrieval. Dynamic example selection introduces selection-induced collapse risk. Validate that your task benefits from retrieval before building the infrastructure for it.
  • Scale down for reasoning models. If you're using o1, o3, or Claude's extended thinking modes, start at zero-shot. These models benefit from space to reason, not constrained demonstrations. Few-shot prompting has been shown to degrade performance in this model class — they internalize their own chain-of-thought rather than following yours.
  • If you need more than 8 examples to reach target accuracy, investigate fine-tuning. You're likely at the ceiling of what in-context learning can do.

Conclusion

The few-shot saturation curve isn't a theoretical concern — it's a production failure mode that shows up in A/B tests, regression bugs after prompt updates, and unexplained quality drops after engineers tried to "improve" prompts by adding more examples. The empirical evidence is clear enough that it should change your defaults: treat example count as a hyperparameter that requires testing, not a dial you can increase freely.

The operational implication is straightforward: if you haven't run saturation tests on your production prompts, you likely have examples that are hurting you. The test takes an afternoon. The token savings and accuracy recovery often make it the highest-ROI prompt engineering work you can do.

References:Let's stay in touch and Follow me for more thoughts and updates