Skip to main content

One post tagged with "zero-shot"

View all tags

Zero-Shot vs. Few-Shot in Production: When Examples Help and When They Hurt

· 10 min read
Tian Pan
Software Engineer

The most common advice about few-shot prompting is: add examples, watch quality go up. That advice is wrong often enough that you shouldn't trust it without measuring. In practice, the relationship between examples and performance is non-monotonic — it peaks somewhere and then drops. Sometimes it drops a lot.

A 2025 empirical study tracked 12 LLMs across multiple tasks and found that Gemma 7B fell from 77.9% to 39.9% accuracy on a vulnerability identification task as examples were added beyond the optimal count. LLaMA-2 70B dropped from 68.6% to 21.0% on the same type of task. In code translation benchmarks, functional correctness typically peaks somewhere between 5 and 25 examples and degrades from there. This isn't a quirk of specific models — it's a pattern researchers have named "few-shot collapse," and it shows up broadly.