The System Prompt That Grew Faster Than Your Eval Suite
The day you shipped the agent, the system prompt held three rules and a tone instruction. The eval suite covered each rule with ten cases, the CI badge was green, and the team was justifiably proud. Eighteen months later the same prompt is forty rules, six tool descriptions, four few-shot examples, two safety preambles, and a refusal taxonomy that grew one entry deeper after every incident. The eval suite, by contrast, has added maybe twenty cases — one per incident, authored under pressure, never backfilled for the dozens of rules that arrived quietly through routine prompt PRs.
The team still says "the evals pass" when a PR goes out. What they actually mean is "the evals we wrote eighteen months ago still pass against a prompt those evals don't fully describe anymore." The confidence interval has a denominator that has been silently expanding while the numerator stayed nearly fixed. The next prompt edit that touches one of the thirty-seven untested rules will get graded as safe by a suite that has no opinion on it.
