Skip to main content

The Eval Harness That Ran on Yesterday's Prompt Template After Your Team Shipped a New One

· 9 min read
Tian Pan
Software Engineer

The incident timeline reads cleanly. At 9:02 your platform team pushed prompt-template@v38 to the config service. At 11:14 your dashboards showed everything green. At 16:51 someone in support flagged a spike in escalations. At 17:03 you opened the eval suite, found a regression score of 0.34, and rolled back. The post-mortem says "caught in eight hours, no customer harm beyond the 0.04% who saw it." Engineering leadership applauds the response time.

It is wrong. The regression was caught in zero hours. The eval suite running at 17:03 was the same eval suite running at 09:03. It had been pointed at v37 the entire time. The harness loaded the template from your config service at process startup, cached the rendered prompts as Python objects in module-level scope, and never reread the source. Your live traffic moved to v38 at 9am. Your eval moved at 17:03, when someone restarted the worker pool to "rerun the regression." Eight hours of customer interactions ran against a prompt that no eval had ever scored, while the eval kept grading a prompt that no production request was using.

This is the failure mode no dashboard catches because both systems report success on their own terms. The eval suite is healthy: it ran, it produced scores, it gated nothing because nothing asked it to gate. The prompt versioning system is healthy: v38 is the active version, request logs confirm it, the canary at 5% finished without alarms. The thing that broke is the link between them — the assumption that "the eval is running against the prompt in prod" — and links are not instrumented because nobody owns them.

The cache that nobody knew was a cache

Configuration services exist to remove the latency of reading a config file on every request. You centralize prompts in something like LaunchDarkly's AI configs, Braintrust's prompt library, or an internal service backed by Redis. You expose a client SDK that fetches the latest version. You document that the client is "fast" because it caches. What "caches" usually means in practice: the client fetches once at construction time and then trusts the in-memory copy until the process restarts.

That contract is fine for prompts that change at deploy cadence — every push restarts every worker, the cache invalidates implicitly, nobody notices. It fails the moment prompts ship independently of code. The whole point of moving prompts to a config service was to let prompt engineers iterate without a code deploy. Which means the cache invalidation question is now load-bearing, and the answer your SDK provides — "restart the process" — is incompatible with the workflow the platform was built to enable.

The eval harness inherits this defect by accident. It uses the same SDK, holds the same cached copy, and runs on a long-lived worker pool that exists precisely to avoid the cost of rebuilding the eval graph on every run. The longer the worker stays alive, the more confident you get that "the eval pipeline is stable," and the further the cached template drifts from production. Stability of the harness is precisely what produces the drift.

The metric that measures itself

The deeper problem is that the regression score reported by a stale eval is internally consistent. The eval grades v37 against the golden dataset. v37 was tuned against that dataset. The score is 0.91. It has been 0.91 for weeks. The score will continue to be 0.91 as long as the eval keeps grading v37, no matter what v38 or v39 does in production. There is no anomaly to alert on because the only thing changing in the world the eval can see is sampling noise.

You can confirm this with a thought experiment. If the prompt service silently returned v37 to every consumer for the rest of the year — eval, prod, canary, everyone — your dashboards would not flicker. Your eval scores would stay flat. Your prompt versioning UI would show v38 as "active." The metric you trust to catch regressions has no opinion about whether the system it is grading is the system you are running. It cannot have an opinion, because nothing in its input forces it to notice.

This is the structural property the eval/prod gap rests on: offline evals validate a fixed artifact against a fixed dataset. They are designed not to vary. When the artifact under test silently decouples from the artifact in production, the design that makes offline evals reproducible is the same design that makes them blind to the decoupling.

What "graded the wrong system" actually costs

The eight-hour window matters less than the conclusions drawn during it. Product asks "did the new prompt help?" and the eval says "no change." Engineering asks "should we ship v39?" and the eval says "v38 is fine, keep iterating." A prompt engineer looks at the v38-vs-v37 comparison in the eval dashboard, sees no meaningful delta, and concludes the change was a wash — which feels like permission to ship the next change on top, because the last one was neutral.

By the time someone in support surfaces real-world behavior, the team has stacked three more prompt changes on a base they thought was neutral and was actually a regression. The rollback is not "revert v38." It is "figure out which of v38, v39, v40, and v41 was the one that broke things, given that none of them were ever graded against the dataset everyone thinks they were graded against."

The recovery cost is not the eight hours of customer impact. It is the entire week of prompt iteration whose evaluation evidence is now invalidated, and the engineering trust in the eval scoreboard, which does not come back quickly once people have seen it report green on a broken release. The cheap framing — "our eval lagged by eight hours" — hides the expensive reality, which is that every prompt change shipped during that window has to be re-evaluated by hand, and the team will second-guess the eval for months.

Fixing the cache is the obvious move and the wrong place to start. You can make the SDK poll for changes every thirty seconds and you still have not answered the question "is the eval grading what production is running right now?" You have only reduced the window in which the answer is no.

The fix that holds up is to make the eval grade what production ran, not what the config service currently says. Concretely: every prompt render in production emits the resolved template hash alongside the response. The eval harness, when it picks up a sample of production traces to score, reads the template hash from the trace and rehydrates that exact template — not the one in its cache, not the one currently active. The eval becomes a function of "the prompt this request actually saw," which makes drift impossible to express. If the harness cannot find the template it needs, it fails loudly instead of silently substituting yesterday's copy.

The same logic applies to offline regression runs. The CI gate on a prompt change should not ask "did the harness score v38." It should ask "did the harness score the hash that v38 resolves to in the same environment that prod resolves it in." Pin the artifact at submission time. Treat the prompt version like a model version: immutable, content-addressed, joined to evaluation results by hash, not by name. The active-pointer indirection is what lets the eval and prod drift apart; remove the indirection and the drift becomes a compile error rather than a silent reporting bug.

For the harness process itself, the discipline is the discipline of any long-running consumer of mutable upstream state: either restart on every run, or treat the cache as a derived view that the upstream is responsible for invalidating. The middle ground — "I'll just hold a reference and assume it stays fresh" — is the configuration of every incident in this category. Pick a side. The eval suite that restarts cold on every invocation costs more compute and never lies about which version it graded; the eval suite that runs warm and stale costs nothing and lies whenever it matters most.

The audit that catches the next one

The reason this failure mode keeps recurring is that it has no surface in any standard runbook. There is no metric called "eval-prod prompt skew." There is no alert that fires when the harness cache age exceeds the deploy cadence. The team's mental model is that "the eval is the eval" and "the prompt is the prompt," and the integration between them is treated as a wire, not a system.

A short audit makes the wire visible. Pick any production trace from the last hour. Extract the prompt template hash. Look up the same trace in the eval scoreboard for the same time window. Confirm that the hash the eval used matches the hash production used. If your system cannot answer this question in under five minutes, the eval/prod link is implicit, which means it is decoupled in a way nobody has noticed yet. The next stale-eval incident is already in flight; you just have not seen the support ticket for it.

The teams that survive this category do not have smarter evals. They have eval pipelines that refuse to produce a score without proof of what they graded, and they treat that proof as part of the deliverable. The score on its own is a number. The score plus the prompt hash plus the model version plus the dataset commit is an artifact. Anything less is the dashboard from the timeline above — green for eight hours on a system that nobody in your eval pipeline had actually looked at.

References:Let's stay in touch and Follow me for more thoughts and updates