Skip to main content

158 posts tagged with "prompt-engineering"

View all tags

The Eval Harness That Ran on Yesterday's Prompt Template After Your Team Shipped a New One

· 9 min read
Tian Pan
Software Engineer

The incident timeline reads cleanly. At 9:02 your platform team pushed prompt-template@v38 to the config service. At 11:14 your dashboards showed everything green. At 16:51 someone in support flagged a spike in escalations. At 17:03 you opened the eval suite, found a regression score of 0.34, and rolled back. The post-mortem says "caught in eight hours, no customer harm beyond the 0.04% who saw it." Engineering leadership applauds the response time.

It is wrong. The regression was caught in zero hours. The eval suite running at 17:03 was the same eval suite running at 09:03. It had been pointed at v37 the entire time. The harness loaded the template from your config service at process startup, cached the rendered prompts as Python objects in module-level scope, and never reread the source. Your live traffic moved to v38 at 9am. Your eval moved at 17:03, when someone restarted the worker pool to "rerun the regression." Eight hours of customer interactions ran against a prompt that no eval had ever scored, while the eval kept grading a prompt that no production request was using.

The Eval Set Your Prompt Engineers Turned Into Production Few-Shots

· 11 min read
Tian Pan
Software Engineer

The eval dashboard had been climbing for three sprints. Quality up six points on the hard slice, up nine on the regression slice, up twelve on the slice the support team had hand-curated from last quarter's worst tickets. The team shipped a model promotion off the back of it. Two days later, a customer asked a question that looked nothing like anything in the eval set, and the answer was worse than what they had been getting six months ago.

The forensic was quick once someone thought to run it. The prompt engineers had been working out of the same repo as the eval team. They had found the curated examples — the painstaking ones, the ones where someone had argued for an hour about the correct phrasing of the ideal answer — and over a few sprints they had copy-pasted the strongest of them as few-shot demonstrations into the production system prompt. The dashboard kept going up because the model was being graded on inputs it had seen verbatim at inference time. Nobody flagged it. Nobody owned the boundary between "the examples we measure quality against" and "the examples we ship in the prompt." Both teams were doing exactly the job they had been hired to do.

The Fallback Model Whose System Prompt Was Tuned for Someone Else

· 10 min read
Tian Pan
Software Engineer

Your reliability dashboard says 99.95%. Your support inbox says something else. Twice a week, for ten or twenty minutes at a time, a thin sliver of users gets a version of your product that talks like a different company. The refusals read funny. A structured field that always rendered as a tidy two-column card now shows up as a paragraph with bullet points smashed inside it. Tone shifts from "calm expert" to "eager assistant." Nobody opens a ticket — they just close the tab and try again later.

Your provider went down. The failover worked. Latency stayed under SLO. The error budget did not move. And the experience your users got during that window was not the one you ship.

The mental model most teams carry into multi-provider architecture is that the system prompt is portable — a contract negotiated with the abstract idea of "a capable model," readable by anyone who speaks the LLM dialect. That model is wrong. A system prompt is a tuned artifact. It is tuned against a specific model's preferences, refusal grammar, formatting habits, and instruction-following biases. When the failover engages, you are not handing the same contract to a comparable counterparty. You are handing a contract written in your primary's idiom to a model that reads a different idiom and signs it anyway.

The Few-Shot Example Your Model Treated as Binding Precedent

· 10 min read
Tian Pan
Software Engineer

A user submits a question. Your model produces an answer that is confidently wrong in a very specific way: the format is perfect, the reasoning is well-structured, and a particular qualifier — one that does not apply to this question at all — appears in exactly the place a similar qualifier appeared in example three of your system prompt. Not a hallucination. Not a prompt injection. The model did precisely what the examples taught it to do, on a question those examples were never meant to cover.

This is the failure mode that few-shot prompting actively encourages and that most eval suites are structurally blind to. Your examples are not neutral demonstrations of "what good looks like." They are case law. The model selects the closest match by surface tokens and applies the precedent — including its constraints — to whatever case is in front of it.

The Legal Disclaimer That Leaked From The Answer Into The Tool Call Arguments

· 9 min read
Tian Pan
Software Engineer

Your counsel approved a one-line system-prompt directive: append "This information is not legal advice and should not be relied upon as such" to every response touching a regulated domain. Three weeks later, a user files a bug because their calendar event's description field opens with that same line, followed by a contract summary the agent was supposed to put into a meeting invite. The agent did not malfunction. It did exactly what the system prompt told it to do, which turned out to be a behavior that ranges over every channel the model produces text into — including the JSON arguments of the next tool it called.

The instruction was a content-formatting rule and the model treated it as one. It did not distinguish "user-facing response" from "tool call argument" because nothing in the prompt told it those were different surfaces. The disclaimer ended up in the calendar, in the email draft, in the Slack message your agent posted on the user's behalf. Each of these was a separate downstream system whose author had no idea a compliance string was about to be injected into a structured field, and each had a different cleanup cost.

The Persona Your System Prompt Offered That the Model Picked the Same Way Every Time

· 10 min read
Tian Pan
Software Engineer

A product team I talked to recently ran a three-arm A/B test on response personas — concise, thorough, conversational — for three weeks across every cohort. The system prompt described all three and asked the model to pick the one that best matched the user. When they opened the dataset to write the readout, one number stopped them cold: the "thorough" arm had 91% of the traffic. The other two were rounding error.

Their experiment platform had not flagged anything. No alert fired. The pipeline did exactly what they had told it to do. Three weeks of supposed multi-persona testing had produced a dataset that could only tell them about thorough. The other two arms were too thin to power any inference at all.

The instinct in the room was that the prompt needed work — better instructions, sharper distinctions between personas, a more deliberate example for the conversational case. That diagnosis would have been right ten years ago in a rules-driven router. It is wrong for a model. The prompt was not the variable. The router was.

The Prompt Engineer Who Quietly Became Your Only Eval Set Reader

· 8 min read
Tian Pan
Software Engineer

The eval set is a file. It is also, secretly, a theory of what the AI feature is for. The two are not the same thing, and the team that confuses them has built a quality gate whose calibration depends on a single human's working memory. When that human leaves, the file stays and the theory walks out the door.

This is the failure mode you don't see in the org chart. You scoped a prompt engineering role. You hired someone good. They shipped the v1 prompts, looked at the thin benchmark, and rewrote it into something rich — a taxonomy of failure modes, weights per category, a labeling rubric that disambiguates edge cases. The eval set became the contract for "is this model good enough to ship." Six quarters later you discover that the contract is unreadable by anyone except the person who wrote it.

The Prompt Version Your Team Treated as Independent of the Model Version

· 9 min read
Tian Pan
Software Engineer

Your incident timeline says "no deploys in the last 72 hours." Your prompt registry agrees: prompt v37 has been frozen for three weeks. Your eval harness ran clean on Tuesday night. But on Wednesday morning, the structured-output failure rate on one of your agents tripled, the retry budget on another doubled, and a third started cheerfully ignoring an instruction it had been honoring for a month. Nothing changed. Except something did change, and it changed in the place neither side of the org was watching: the model.

The prompt registry knows about prompt versions. The model gateway knows about model versions. Almost nobody, in practice, tracks the pair. And prompt v37 isn't a free-standing artifact — it is, whether your tooling admits it or not, a contract negotiated against one specific model. When the platform team rolls the claude-sonnet-latest alias forward by a single point release, the contract on the other side has been quietly amended, and your incident timeline reads "no deploys" because the deploy happened on someone else's infrastructure under a name that didn't move.

The Chargeback Model That Made Every Team Rewrite Their Prompts Overnight

· 10 min read
Tian Pan
Software Engineer

Finance sent a memo on a Monday. By Friday, every product team had shipped a prompt change, and on the following Tuesday the support queue grew by a third. Nobody had touched the model. Nobody had touched the product. The only thing that had changed was that the LLM bill was now flowing back to the teams that issued the calls — and the teams had responded the way any rational cost center responds to a new line item on its P&L. They cut it.

The story that gets told inside the company afterwards is a story about prompt engineering, or about the model regressing, or about a noisy week of user traffic. The truer story is that finance, through a chargeback policy, had quietly become a product manager. The cost-attribution dashboard was a product-quality lever that nobody had reviewed, nobody had instrumented for, and nobody owned. When it moved, every prompt in the company moved with it, and the trade-offs that produced the quality regression were never seen by the people whose job it was to see them.

The Prompt Hot-Reload That Orphaned Every In-Flight Agent Run

· 11 min read
Tian Pan
Software Engineer

The pager went off at 11:47pm. A customer had been ten minutes into a refund conversation when the agent suddenly stopped calling the process_refund tool it had been reasoning about for the entire session, hallucinated a confirmation number, and ended the chat. By the time we traced it back, the cause was obvious in retrospect: a teammate had pushed an updated system prompt at 11:46. The push was clean, the tests passed, and every new conversation worked perfectly. The few hundred conversations already in progress did not.

We had built our prompt registry to support what every prompt-versioning vendor in 2026 markets as a feature: hot-reload without redeploy. We had treated that capability as if it were a CDN cache flush — a global swap that takes effect everywhere at once. What it actually was, we learned that night, was a contract break. Every active conversation was an in-flight negotiation between an LLM and a set of instructions plus tool definitions it had been reasoning against. When the registry swapped the prompt under those conversations, half the negotiated context was now stale.

The System Prompt That Grew Faster Than Your Eval Suite

· 11 min read
Tian Pan
Software Engineer

The day you shipped the agent, the system prompt held three rules and a tone instruction. The eval suite covered each rule with ten cases, the CI badge was green, and the team was justifiably proud. Eighteen months later the same prompt is forty rules, six tool descriptions, four few-shot examples, two safety preambles, and a refusal taxonomy that grew one entry deeper after every incident. The eval suite, by contrast, has added maybe twenty cases — one per incident, authored under pressure, never backfilled for the dozens of rules that arrived quietly through routine prompt PRs.

The team still says "the evals pass" when a PR goes out. What they actually mean is "the evals we wrote eighteen months ago still pass against a prompt those evals don't fully describe anymore." The confidence interval has a denominator that has been silently expanding while the numerator stayed nearly fixed. The next prompt edit that touches one of the thirty-seven untested rules will get graded as safe by a suite that has no opinion on it.

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.