The Eval Harness That Ran on Yesterday's Prompt Template After Your Team Shipped a New One
The incident timeline reads cleanly. At 9:02 your platform team pushed prompt-template@v38 to the config service. At 11:14 your dashboards showed everything green. At 16:51 someone in support flagged a spike in escalations. At 17:03 you opened the eval suite, found a regression score of 0.34, and rolled back. The post-mortem says "caught in eight hours, no customer harm beyond the 0.04% who saw it." Engineering leadership applauds the response time.
It is wrong. The regression was caught in zero hours. The eval suite running at 17:03 was the same eval suite running at 09:03. It had been pointed at v37 the entire time. The harness loaded the template from your config service at process startup, cached the rendered prompts as Python objects in module-level scope, and never reread the source. Your live traffic moved to v38 at 9am. Your eval moved at 17:03, when someone restarted the worker pool to "rerun the regression." Eight hours of customer interactions ran against a prompt that no eval had ever scored, while the eval kept grading a prompt that no production request was using.
