Skip to main content

Eval-Prod Drift: The Agent Under Test Isn't the Agent in Production

· 11 min read
Tian Pan
Software Engineer

The eval suite is green. The dashboard is green. A week later, support is drowning in the same complaint: "the assistant keeps refusing to book the meeting." You open the eval harness, replay the failing trace, and it works. Perfectly. Every time. The bug is not in your eval, and it is not in your model. The bug is that the agent your eval is measuring and the agent your customer is talking to are no longer the same system, and nobody has admitted it yet.

Eval-prod drift is the slow, unattributed divergence between what your eval harness loads into the agent and what your serving stack actually assembles at request time. Prompts, model pins, tool schemas, guardrail configs, and feature flags each flow into the agent through different deployment paths — code merges, config pushes, prompt-registry webhooks, experimentation platforms, runtime rollouts — and almost no team has a single source of truth that reconciles them. So the eval harness ends up measuring the version of the agent that exists in someone's PR branch, while production is running a union of yesterday's hotfix, last week's flag variant, and whatever the tool team pushed without telling anyone.

This is not a theoretical failure mode. It is the default state of any agent system older than three months whose config lives in more than one repository.

The Agent Is Not One Artifact

If you ask an engineer "what is the agent?" they will usually point at a Python class, a prompt template, or an orchestration graph. The honest answer is: the agent is the runtime composition of roughly seven artifacts, and most of them ship independently.

Typical breakdown:

  • System prompt. Lives in a prompt registry or a config file. Often editable by PMs through a UI.
  • Model pin and decoding params. Lives in environment variables, a runtime config, or a feature flag.
  • Tool schemas. Generated from code in the tool-owning service, resolved by name at request time, sometimes cached.
  • Tool descriptions. Often edited separately from schemas, sometimes in the agent repo, sometimes in the tool repo.
  • Retrieval config. Index name, embedding model, chunker version, reranker weights. Each with its own deploy cadence.
  • Guardrails and post-processors. Regex filters, classifier gates, refusal policies, output validators.
  • Feature flags and experiment variants. Cohort-gated, user-gated, region-gated, live-updatable.

The eval harness loads one snapshot of these seven. Production assembles another snapshot, fresh, at every request. The probability that the two snapshots are byte-identical on any given day is low, and nobody is checking.

Worse, the drift is asymmetric. Eval usually runs against whatever is on the main branch plus the datasets the eval team curates. Production runs against whatever survived the last four deploys across five repos plus whatever the flag system is serving to the user's cohort right now. Eval is a static build. Production is a live merge.

Five Paths That Quietly Split the Config

You do not get eval-prod drift by having one bad pipeline. You get it by having several pipelines, each of which is fine in isolation.

The hotfix path. An incident at 2 a.m.: the agent is leaking a banned phrase. Someone edits the system prompt through the prompt-management UI and hits deploy. Production recovers. The eval harness, which pins to the main branch's prompt file, never sees the edit. Three weeks later, an unrelated eval regression is debugged against a prompt that does not exist in production.

The eval-PR path. An evals engineer adds new test cases and, in the same PR, tweaks the system prompt to make them pass. CI goes green. The PR merges. But the prompt actually serving users is fetched from the prompt registry at runtime, which ignores the repo entirely. Eval improved; production did not move.

The feature-flag path. The growth team ships a variant of the agent that is more concise for free users and more thorough for paying users. The flag system resolves the variant per request. The eval harness runs with the flag off, because that is the deterministic choice. Free users are running an agent that has never been evaluated.

The A/B experiment path. A prompt refinement is ramped to 10% of traffic through an experiment framework that lives in a different service than the prompt registry. The eval harness sees the control. The 10% is measured only by the experiment's own success metric — typically a single thumbs-up rate, not the full eval rubric. Regressions in the unmeasured dimensions (refusal rate, tool-call accuracy, latency tail) accumulate invisibly.

The tool-schema path. The calendar team renames a parameter from attendee_emails to invitees and updates their service. The tool description pulled into the agent's context at request time now mentions a field that does not match the parameter name the model is still trained to produce. Eval runs against a frozen schema snapshot and never sees the mismatch. Production tool-call success rate quietly drops 8%.

Each of these is a reasonable, even desirable, deployment practice. Hotfixes save incidents. Feature flags enable progressive rollout. Experimentation is how you learn. The problem is the union: five independent decision-makers push five independent changes, and the eval harness observes none of them.

The Single-Fingerprint Discipline

The first mitigation is not more evals. It is a single, cryptographic fingerprint that uniquely identifies the full composition of the agent at request time, and that shows up in every log line, every eval trace, and every dashboard.

Concretely, every inference request should carry a structured header or metadata field that lists:

  • The hash of the resolved system prompt (post-template-substitution).
  • The exact model ID and version pin, including provider.
  • The hash of the full tool manifest (schemas plus descriptions) the model saw.
  • The active feature-flag and experiment variant IDs.
  • The retrieval config hash (index, embedder, reranker).
  • The guardrail bundle version.

Concatenate and hash. That one value is the agent's identity on that request. If two requests have different fingerprints, they were served by different agents, full stop. If your eval harness does not emit a fingerprint, it is not evaluating an agent — it is evaluating a Platonic ideal that does not exist anywhere in your stack.

Once fingerprints exist, the core drift-detection CI pass becomes trivial to describe: take the dominant fingerprint from the last 24 hours of production traffic, force the eval harness to load exactly that composition, and run your golden set. Any disagreement between "what eval just scored" and "what eval scored on the main-branch composition yesterday" is a drift alert. Any fingerprint appearing in production that has never been evaluated at all is a bigger alert.

This is also how you catch the thing no static eval ever catches: a prod-only fingerprint. When production is running a combination of flag state and prompt version that has never flowed through the eval pipeline, the correct alert is "you are serving an unevaluated agent to N% of users," not "the eval is green."

The Canary Suite as the Convergence Test

Once you can identify which agent is running where, you need a cheap way to test that the agents agree. The canary suite is that test.

A canary suite is 20–50 hand-picked cases with unambiguous expected behavior. It covers the easy cases the agent should never break: a simple tool call, a clear refusal, a standard retrieval lookup, a basic multi-turn clarification. It is deliberately small. It is deliberately boring. It is deliberately replayable end-to-end in under a minute.

The discipline: the canary suite runs against production continuously — not via shadow traffic and not via a separate staging agent, but by issuing synthetic requests to the production endpoint and scoring the responses with the same graders the eval harness uses. It also runs against the eval-harness composition on every CI pass. The signal you care about is the diff. When the canary suite passes in eval and fails in production, eval-prod drift just got caught. When it fails in both, you have a real regression. When it passes in both, the agent is probably fine.

This is different from monitoring model quality in production, which most teams already do. Production quality monitoring tells you "users are less satisfied this week." The canary-suite-as-convergence-test tells you "the eval harness and the production server disagree on behavior the eval harness said was locked in." The former is a lagging indicator. The latter is the upstream cause.

A practical refinement: stratify the canary suite by subsystem. Some cases exercise only the prompt. Some exercise the tool schema path. Some exercise the retrieval config. When the suite fails, the failing stratum tells you which of the seven artifacts drifted, which shortens the post-mortem from hours to minutes.

Config Convergence Is an Org Problem, Not a Tooling Problem

The reason eval-prod drift persists in mature orgs is not that teams cannot build fingerprinting or canary suites. It is that the humans making the deployment decisions report into different VPs.

Prompt edits often sit with the PM or the applied-AI team. Model pins usually live with platform engineering. Tool schemas belong to whichever backend team owns the capability. Feature flags are a growth or product function. Experiments are run by data science. The eval harness is typically owned by ML engineering or a dedicated evals team. Six owners, six cadences, six definitions of "ship."

No fingerprinting scheme survives contact with this org chart unless someone is accountable for the composition as a unit. Concretely, that role looks like:

  • One on-call rotation that owns the production fingerprint and is paged when an unevaluated fingerprint hits more than 5% of traffic.
  • One change-management review where any edit to any of the seven artifacts is announced with its expected impact and the canary-suite result.
  • One dashboard that displays, right next to eval scores, the fraction of production traffic on each fingerprint and the overlap with the fingerprint the eval harness scored.

Without this, the drift reappears the moment the on-call engineer ships a hotfix without updating the eval repo — which is always, because hotfixes are the whole point of being on call.

Treat Promotion as Atomic, Not Gradual

The usual rollout pattern — tweak a prompt, flip a flag, ramp slowly — is how drift is created, not how it is avoided. Every gradual rollout is, by construction, a window during which eval and production disagree. That window is fine if it is small and well-instrumented. It is the source of every eval-prod incident if it is neither.

The cleaner discipline: every change to the agent composition is a promotion. A promotion bundles all seven artifacts into a new fingerprint, runs them together through eval, runs them together through the canary suite in shadow mode, and then flips traffic atomically. Experiments and rollouts still exist, but they happen between fingerprints, not inside them. An A/B test is a comparison of two whole agents, each of which was evaluated as a unit, not a comparison of one prompt variant against a moving background of everything else.

Most teams resist this because atomic promotion feels heavy compared to a hot-reloadable prompt. The reason to accept the weight is that the alternative — the status quo — is a system where the eval passed and users are complaining, and nobody can tell you whether that is a measurement problem or a product problem, and the investigation costs more than the promotion discipline ever would.

The Posture to Adopt Tomorrow

You do not need to rebuild your deployment pipeline this quarter. You need three cheap things first, in order.

  • Emit the fingerprint. Add a single structured field to every inference log that captures the hash of the full composition. You can defer the eval-side integration for weeks; the server-side hash is valuable on its own because the first thing you will discover is how many distinct fingerprints you are already serving.
  • Stand up the canary suite. Twenty cases, ten minutes to write, one cron to run every hour. If production disagrees with eval on any of them, page someone. The suite's whole job is to be a tripwire, not a benchmark.
  • Name the owner. Pick the engineer or team who will wake up when the fingerprint distribution looks wrong. Without that, the first two are decoration.

The headline regression you are trying to prevent is not a bad eval. It is the quiet weeks after a green eval when the agent your users actually talk to has drifted from the one you said you shipped. That gap closes only when the system can no longer plausibly deny it exists.

References:Let's stay in touch and Follow me for more thoughts and updates