Eval-Prod Drift: The Agent Under Test Isn't the Agent in Production
The eval suite is green. The dashboard is green. A week later, support is drowning in the same complaint: "the assistant keeps refusing to book the meeting." You open the eval harness, replay the failing trace, and it works. Perfectly. Every time. The bug is not in your eval, and it is not in your model. The bug is that the agent your eval is measuring and the agent your customer is talking to are no longer the same system, and nobody has admitted it yet.
Eval-prod drift is the slow, unattributed divergence between what your eval harness loads into the agent and what your serving stack actually assembles at request time. Prompts, model pins, tool schemas, guardrail configs, and feature flags each flow into the agent through different deployment paths — code merges, config pushes, prompt-registry webhooks, experimentation platforms, runtime rollouts — and almost no team has a single source of truth that reconciles them. So the eval harness ends up measuring the version of the agent that exists in someone's PR branch, while production is running a union of yesterday's hotfix, last week's flag variant, and whatever the tool team pushed without telling anyone.
This is not a theoretical failure mode. It is the default state of any agent system older than three months whose config lives in more than one repository.
