Skip to main content

5 posts tagged with "prompting"

View all tags

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.

That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.

Zero-Shot, Few-Shot, or Chain-of-Thought: A Production Decision Framework

· 10 min read
Tian Pan
Software Engineer

Ask most engineers why they're using few-shot prompting in production, and you'll hear something like: "It seemed to work better." Ask why they added chain-of-thought, and the answer is usually: "I read it helps with reasoning." These aren't wrong answers, exactly. But they're convention masquerading as engineering. The evidence on when each prompting technique actually outperforms is specific enough that you can make this decision systematically—and the right choice can cut token costs by 60–80% or prevent a degradation you didn't know you were causing.

Here's what the research says, and how to apply it to your stack.

Zero-Shot vs. Few-Shot in Production: When Examples Help and When They Hurt

· 10 min read
Tian Pan
Software Engineer

The most common advice about few-shot prompting is: add examples, watch quality go up. That advice is wrong often enough that you shouldn't trust it without measuring. In practice, the relationship between examples and performance is non-monotonic — it peaks somewhere and then drops. Sometimes it drops a lot.

A 2025 empirical study tracked 12 LLMs across multiple tasks and found that Gemma 7B fell from 77.9% to 39.9% accuracy on a vulnerability identification task as examples were added beyond the optimal count. LLaMA-2 70B dropped from 68.6% to 21.0% on the same type of task. In code translation benchmarks, functional correctness typically peaks somewhere between 5 and 25 examples and degrades from there. This isn't a quirk of specific models — it's a pattern researchers have named "few-shot collapse," and it shows up broadly.

Dynamic Few-Shot Retrieval: Why Your Static Examples Are Costing You Accuracy

· 11 min read
Tian Pan
Software Engineer

When a team hardcodes three example input-output pairs at the top of a system prompt, it feels like a reasonable engineering decision. The examples are hand-verified, formatting is consistent, and the model behavior predictably improves. Six months later, the same three examples are still there — covering 30% of incoming queries well, covering the rest indifferently, and nobody has run the numbers to find out which is which.

Static few-shot prompting is the most underexamined performance sink in production LLM systems. The alternative — selecting examples per request based on semantic similarity to the actual query — consistently outperforms fixed examples by double-digit quality margins across diverse task types. But the transition is neither free nor risk-free, and the failure modes on the dynamic side are less obvious than on the static side.

This post covers what the research actually shows, how the retrieval stack works in production, the ordering and poisoning risks that most practitioners miss, and the specific cases where static examples should win.

The Token Economics of Chain-of-Thought: When Thinking Out Loud Costs More Than It's Worth

· 8 min read
Tian Pan
Software Engineer

Chain-of-thought prompting was one of the most important discoveries in applied LLM engineering. Ask a model to "think step by step," and accuracy jumps on math, logic, and multi-hop reasoning tasks. The technique became so standard that many teams apply it reflexively to every prompt in their system — classification, extraction, summarization, routing — without asking whether it's actually helping.

It usually isn't. Recent research from Wharton's Generative AI Lab shows that chain-of-thought provides no statistically significant improvement for one-third of model-task combinations, and actively hurts performance in others. Meanwhile, every CoT request inflates your token bill by 2–5x and adds seconds of latency. For production systems handling millions of requests, that's not a prompting strategy — it's an unaudited cost center.