The Agent Backfill Problem: Your Model Upgrade Is a Trial of the Last 90 Days
Here is a Tuesday-morning conversation that nobody on your AI team is prepared for. The new model lands in shadow mode. Within an hour the eval dashboard lights up: it categorizes 4% of refund requests differently than the model you have been running for the last quarter. Most of those flips look like the new model is right. Someone in the room — usually the one with the most lawyers in their reporting line — asks the question that ends the celebration: so what are we doing about the ninety days of decisions the old model already shipped?
That is the agent backfill problem. The moment a smarter model starts producing outputs that look more correct than your previous model's, every durable decision the previous model made becomes a contested record. You did not intend to indict the past. The new model did it for you, automatically, the first time you compared traces. And now you have an engineering question (can we replay history?), a legal question (do we have to disclose corrected outcomes?), and a product question (do users see retroactive changes?), and they collide.
The traditional ML team has a version of this problem and is mostly fine with it. A churn model gets retrained, the new probabilities differ, nobody sends an apology email to last quarter's "likely to churn" customers. The decision was a score nobody acted on alone; humans were in the loop; the action was reversible. Agent systems do not get to hide behind any of that. The model approved the refund. The model classified the document. The model closed the support ticket. The action shipped. Now the new version disagrees.
What changes when decisions are durable
Most LLM eval frameworks are built for the case where you score model outputs and then decide whether to ship. Golden-set replay catches regressions against a fixed reference; shadow mode lets you score new responses against current ones; production sampling keeps an ongoing read on quality. All of that machinery assumes the unit of analysis is "did the model produce a good response." It says nothing about what to do when the response was the action.
Once the agent has durable side effects, the comparison stops being academic. A refund was issued, a ticket was routed to a human, a candidate was screened out, a transaction was flagged. The new model's output is no longer an opinion about a benchmark; it is an opinion about a decision your company already made and acted on. When you ship the upgrade, you are publishing — internally, at minimum — a continuously updating list of cases where the previous model and the current model disagree.
Three properties make this category of decision distinct from the predictions ML teams have been making for a decade:
- Visibility of the action. A score in a database is invisible. A refund denial in someone's inbox is not. The user remembers; the auditor can subpoena the email.
- Asymmetry of error cost. A 3% accuracy improvement is great in aggregate but the affected user does not experience the average. They experience the specific case where the old model said no.
- Specificity of the decision rationale. Modern agents do not just emit a label; they emit reasoning, tool calls, and citations. That artifact is what regulators are starting to ask for, and it is what gets compared across model versions whether you wanted that comparison or not.
Three flavors of replay (only one is cheap)
When teams say "we should replay the last 90 days through the new model," they almost always mean one of three different things, and the cost ladder between them is steep.
Eval replay is the cheap version. You take a representative sample of historical inputs, run them through the new model in a sandbox, score the outputs against the old model's outputs (or against a held-out human label set), and produce a report. This is a regression test. Nobody's account state changes. You should already be doing this; if you are not, the rest of the conversation is premature.
Decision replay is the middle tier. You re-run the new model against historical inputs and produce a what would the new model have done artifact for each historical decision. The output is a diff: cases where the new model would have approved instead of denied, classified A instead of B, escalated instead of auto-resolved. No real-world side effects fire — you are generating a counterfactual record, not acting on it. This is what compliance teams quietly want when a high-impact bug is found in the old model. It is also expensive: you need every input the old model saw, exactly as it saw it, including any retrieved context, tool outputs, and user state at the moment of decision.
Action replay is the version that gets people fired if you do it wrong. You re-run the new model against historical inputs and let it take action — issue the refund, send the email, reverse the ticket closure. This is what someone proposes in the third meeting after a decision-replay report shows that 1.2% of refund denials should have been approvals. It is also where idempotency, communication, and consent collide. Send the same apology twice and you are now the company that emailed an old customer about a six-month-old refund they did not remember requesting.
The hierarchy is the point. Most organizations skip directly from "we should replay this" to imagining the action-replay outcome, then bounce off the political cost and do nothing. The healthier path is to make eval and decision replay continuous, so the question "what would have happened differently" stops being a special project and starts being a dashboard. Action replay then becomes a deliberate, scoped intervention rather than a vague aspiration.
Architectural prerequisites you need before any of this works
You cannot replay what you did not capture. The most common failure mode of the backfill conversation is realizing, three weeks in, that the inputs to the old model were assembled at request time from systems that have since changed shape, and there is no snapshot of what the model actually saw. By the time you discover this, the option is gone.
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://docs.azure.cn/en-us/databricks/mlflow3/genai/eval-monitor/production-monitoring
- https://www.swept.ai/ai-audit-trail
- https://artificialintelligenceact.eu/article/86/
- https://artificialintelligenceact.eu/annex/3/
- https://www.consumerfinance.gov/about-us/newsroom/cfpb-issues-guidance-on-credit-denials-by-lenders-using-artificial-intelligence/
- https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/
- https://www.qwak.com/post/shadow-deployment-vs-canary-release-of-machine-learning-models
- https://aws.amazon.com/blogs/machine-learning/minimize-the-production-impact-of-ml-model-updates-with-amazon-sagemaker-shadow-testing/
- https://www.mdpi.com/1999-5903/17/4/151
- https://galileo.ai/blog/ai-agent-compliance-governance-audit-trails-risk-management
- https://arxiv.org/html/2505.17716v1
