The System Prompt That Grew Faster Than Your Eval Suite
The day you shipped the agent, the system prompt held three rules and a tone instruction. The eval suite covered each rule with ten cases, the CI badge was green, and the team was justifiably proud. Eighteen months later the same prompt is forty rules, six tool descriptions, four few-shot examples, two safety preambles, and a refusal taxonomy that grew one entry deeper after every incident. The eval suite, by contrast, has added maybe twenty cases — one per incident, authored under pressure, never backfilled for the dozens of rules that arrived quietly through routine prompt PRs.
The team still says "the evals pass" when a PR goes out. What they actually mean is "the evals we wrote eighteen months ago still pass against a prompt those evals don't fully describe anymore." The confidence interval has a denominator that has been silently expanding while the numerator stayed nearly fixed. The next prompt edit that touches one of the thirty-seven untested rules will get graded as safe by a suite that has no opinion on it.
This is the same pattern as code without tests, except the asymmetry between how prompts grow and how evals grow makes it worse. A prompt rule is one line — type it, push the PR, get a thumbs-up from a teammate who reads English, ship. An eval case is a structured artifact — write the input, write the expected behavior, write the grader, integrate it into CI, mark its coverage. The first takes thirty seconds, the second takes an afternoon. The two surfaces are evolving on different gradients, and the only thing keeping them in sync was the team's memory of which rules existed when the suite was last audited.
A Prompt Is a Code Surface With No Compiler
The conventional reading of a system prompt is that it is instructions — natural-language guidance the model conditions on. That framing is true and unhelpful. A more useful framing is that a system prompt is a set of addressable, individually testable rules that happen to be encoded as prose. Each rule is a contract: under conditions X, the model should produce behavior Y. The fact that the rules are written in English doesn't make them less specifications; it just means the compiler is the model itself, and the compilation happens at inference time on traffic you may or may not be sampling.
When you look at a prompt that way, the questions you start asking are the questions you would ask of a service. How many rules does this prompt encode? Which of them have tests? Which of them have tests that fire — that produce a meaningful pass/fail rather than degenerating to "the model didn't crash"? Which rules conflict with each other? Which rules are dead — defended by an eval whose triggering scenario no longer appears in production traffic?
None of these questions are answerable by reading the prompt. They are answerable only by treating the prompt as input to tooling. The team that doesn't build that tooling is operating on faith that "we know what's in there" — and as soon as the original authors have rotated off the team, that faith is the only thing left.
How Rules Get In Without Tests
The two-track drift between prompt growth and eval growth has a few characteristic entry paths. Recognizing them is most of the work of stopping them.
The first is the incident patch. Production produced a bad output, the postmortem identified the root cause as the prompt failing to anticipate a scenario, the fix was a new sentence in the system prompt, and the regression test for that exact scenario was added to the suite the same day. The new rule has coverage. So far, so good — except in the next sprint, the same author adds a clarifying clause to an adjacent rule because the new rule's edge cases overlap with old behavior. That clarification ships without its own test, because there was no incident to motivate one. The rule that came in with an incident is tested; the rule that came in alongside it is not.
The second is the safety bolt-on. A new regulation, a new category of misuse, or a new launch-blocker from a security review produces a preamble. The preamble is treated as policy, not as logic — it gets reviewed by the policy team, blessed, and merged. Policy text rarely arrives with eval cases because the team that owns policy doesn't own the eval suite. The rule lands in the prompt as text and nowhere else.
The third is the few-shot grow-in. Someone notices the model handles a particular ambiguity well when given an example, so they add an example. Then someone adds a second example to cover an adjacent case. Then a third. After six months the prompt has four few-shot examples that collectively imply a behavioral rule the team never wrote down explicitly. The rule exists, the rule shapes outputs, the rule is untested — and unlike the other two paths, the rule isn't even written as a rule. It's an emergent property of the examples.
The fourth is the tool description rewrite. The tool catalog changed; the description was updated to reflect new arguments or a new return shape; the prompt's tool section grew by twenty lines. The team that authored the change was the tools team, not the prompts team, and the prompts team didn't notice that the new description implies a usage pattern that contradicts a rule three pages up in the system prompt.
In every one of these paths, the rule got in through low friction and the eval did not get in through high friction. The structural fix is to invert the friction.
Coverage Is a Ratio, and the Denominator Is the Hard Part
If you accept that the prompt is a set of rules and the eval suite is a set of tests, then the obvious metric is rules-with-tests divided by rules-in-prompt. The numerator is easy — count your eval cases, group them by which rule they target. The denominator is the part that breaks teams.
- https://galileo.ai/blog/building-an-effective-llm-evaluation-framework-from-scratch
- https://www.braintrust.dev/articles/llm-evaluation-guide
- https://agenta.ai/blog/prompt-drift
- https://deepchecks.com/llm-production-challenges-prompt-update-incidents/
- https://promptbuilder.cc/blog/prompt-testing-versioning-ci-cd-2025
- https://futureagi.com/blog/prompt-regression-testing-2026/
- https://portkey.ai/blog/the-hidden-technical-debt-in-llm-apps/
- https://arxiv.org/html/2509.20497v1
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
- https://layerlens.ai/blog/llm-evaluation-framework-for-production
