Zero-Shot vs. Few-Shot in Production: When Examples Help and When They Hurt
The most common advice about few-shot prompting is: add examples, watch quality go up. That advice is wrong often enough that you shouldn't trust it without measuring. In practice, the relationship between examples and performance is non-monotonic — it peaks somewhere and then drops. Sometimes it drops a lot.
A 2025 empirical study tracked 12 LLMs across multiple tasks and found that Gemma 7B fell from 77.9% to 39.9% accuracy on a vulnerability identification task as examples were added beyond the optimal count. LLaMA-2 70B dropped from 68.6% to 21.0% on the same type of task. In code translation benchmarks, functional correctness typically peaks somewhere between 5 and 25 examples and degrades from there. This isn't a quirk of specific models — it's a pattern researchers have named "few-shot collapse," and it shows up broadly.
