Skip to main content

The Agent Backfill Problem: Your Model Upgrade Is a Trial of the Last 90 Days

· 12 min read
Tian Pan
Software Engineer

Here is a Tuesday-morning conversation that nobody on your AI team is prepared for. The new model lands in shadow mode. Within an hour the eval dashboard lights up: it categorizes 4% of refund requests differently than the model you have been running for the last quarter. Most of those flips look like the new model is right. Someone in the room — usually the one with the most lawyers in their reporting line — asks the question that ends the celebration: so what are we doing about the ninety days of decisions the old model already shipped?

That is the agent backfill problem. The moment a smarter model starts producing outputs that look more correct than your previous model's, every durable decision the previous model made becomes a contested record. You did not intend to indict the past. The new model did it for you, automatically, the first time you compared traces. And now you have an engineering question (can we replay history?), a legal question (do we have to disclose corrected outcomes?), and a product question (do users see retroactive changes?), and they collide.

The traditional ML team has a version of this problem and is mostly fine with it. A churn model gets retrained, the new probabilities differ, nobody sends an apology email to last quarter's "likely to churn" customers. The decision was a score nobody acted on alone; humans were in the loop; the action was reversible. Agent systems do not get to hide behind any of that. The model approved the refund. The model classified the document. The model closed the support ticket. The action shipped. Now the new version disagrees.

What changes when decisions are durable

Most LLM eval frameworks are built for the case where you score model outputs and then decide whether to ship. Golden-set replay catches regressions against a fixed reference; shadow mode lets you score new responses against current ones; production sampling keeps an ongoing read on quality. All of that machinery assumes the unit of analysis is "did the model produce a good response." It says nothing about what to do when the response was the action.

Once the agent has durable side effects, the comparison stops being academic. A refund was issued, a ticket was routed to a human, a candidate was screened out, a transaction was flagged. The new model's output is no longer an opinion about a benchmark; it is an opinion about a decision your company already made and acted on. When you ship the upgrade, you are publishing — internally, at minimum — a continuously updating list of cases where the previous model and the current model disagree.

Three properties make this category of decision distinct from the predictions ML teams have been making for a decade:

  • Visibility of the action. A score in a database is invisible. A refund denial in someone's inbox is not. The user remembers; the auditor can subpoena the email.
  • Asymmetry of error cost. A 3% accuracy improvement is great in aggregate but the affected user does not experience the average. They experience the specific case where the old model said no.
  • Specificity of the decision rationale. Modern agents do not just emit a label; they emit reasoning, tool calls, and citations. That artifact is what regulators are starting to ask for, and it is what gets compared across model versions whether you wanted that comparison or not.

Three flavors of replay (only one is cheap)

When teams say "we should replay the last 90 days through the new model," they almost always mean one of three different things, and the cost ladder between them is steep.

Eval replay is the cheap version. You take a representative sample of historical inputs, run them through the new model in a sandbox, score the outputs against the old model's outputs (or against a held-out human label set), and produce a report. This is a regression test. Nobody's account state changes. You should already be doing this; if you are not, the rest of the conversation is premature.

Decision replay is the middle tier. You re-run the new model against historical inputs and produce a what would the new model have done artifact for each historical decision. The output is a diff: cases where the new model would have approved instead of denied, classified A instead of B, escalated instead of auto-resolved. No real-world side effects fire — you are generating a counterfactual record, not acting on it. This is what compliance teams quietly want when a high-impact bug is found in the old model. It is also expensive: you need every input the old model saw, exactly as it saw it, including any retrieved context, tool outputs, and user state at the moment of decision.

Action replay is the version that gets people fired if you do it wrong. You re-run the new model against historical inputs and let it take action — issue the refund, send the email, reverse the ticket closure. This is what someone proposes in the third meeting after a decision-replay report shows that 1.2% of refund denials should have been approvals. It is also where idempotency, communication, and consent collide. Send the same apology twice and you are now the company that emailed an old customer about a six-month-old refund they did not remember requesting.

The hierarchy is the point. Most organizations skip directly from "we should replay this" to imagining the action-replay outcome, then bounce off the political cost and do nothing. The healthier path is to make eval and decision replay continuous, so the question "what would have happened differently" stops being a special project and starts being a dashboard. Action replay then becomes a deliberate, scoped intervention rather than a vague aspiration.

Architectural prerequisites you need before any of this works

You cannot replay what you did not capture. The most common failure mode of the backfill conversation is realizing, three weeks in, that the inputs to the old model were assembled at request time from systems that have since changed shape, and there is no snapshot of what the model actually saw. By the time you discover this, the option is gone.

Three primitives separate teams that can do principled replay from teams that can only argue about it:

Decision provenance. For every action the agent takes, store a record that ties the action ID to the model version, the prompt template version, the tool definitions in effect at that moment, the system prompt, and a hash of the retrieved context. The point is not just to reproduce the call — it is to know what specifically changed when you upgrade. If the new model disagrees because the prompt also changed, your "model upgrade replay" is conflating two interventions. Audit trail vendors and frameworks now treat this as table stakes for high-risk AI systems, but most teams retrofit it after their first incident.

Input snapshotting. The model's view of the world at decision time is rarely just the user's message. It is the user's message plus retrieved documents plus tool call results plus session memory plus whatever else got stuffed into context. All of that must be saved verbatim, not reconstructed. Reconstruction is a trap because the underlying systems move: the document the RAG layer retrieved last quarter has been edited; the user's account state has changed; the tool's response schema has been bumped. Without a snapshot, replay becomes a different experiment with the same inputs labeled the same way, and you will not realize until the diffs look suspicious.

Idempotent action layer. This is the one teams underinvest in until they need it. Every side-effecting tool the agent can call must accept a stable idempotency key derived from the decision (typically decision_id plus the action type), and the underlying service must dedupe replays against that key. AWS's reliability guidance on this is a useful starting point: idempotency keys, bounded retries, deterministic key derivation across attempts. Without it, action replay is structurally unsafe — every retry is a new live action — and even decision replay becomes risky if a downstream system is wired to the agent's tool calls without a dry-run flag.

A useful tell: if you cannot answer "what exactly would happen if we re-ran decision X right now in dry-run mode" with a precise list of which side effects would be suppressed, your action layer is not ready.

The policy framework: required, optional, or forbidden

The technical capability to replay decisions does not tell you whether you should. That is a policy question, and the framework that holds up is to bucket each decision class by what regulators and your own users would expect.

Replay is required when the decision is a high-impact automated determination subject to a right to explanation or contestation. The EU AI Act's Article 86 gives affected persons the right to obtain meaningful explanations of decisions made by high-risk AI systems listed in Annex III; the CFPB has been explicit that adverse-action notice obligations under ECOA do not get an AI exemption, and that "complex algorithm" is not a permissible reason. If a previous version of your model issued such a decision, and you discover material error after upgrade, the question is not whether to revisit but how. The replay artifact becomes part of the record of remediation. Bury it and you have a different problem than the original error.

Replay is optional when the decision was reversible, low-impact, or purely informational. A summarization quality regression on archived support tickets does not require backfill; a triage decision that routed a low-priority bug to the wrong queue does not require an apology. Here the cost calculus favors fixing forward — improve the model, log the disagreement, move on. Backfilling these creates noise and trains the organization to treat every model change as a crisis.

Replay is forbidden when reissuing the action would cause a fresh harm independent of the original error. The classic case is medical or financial communications: even if the new model would have phrased something differently, sending a corrected message six months later may itself constitute a new disclosure event with its own legal weight. A subtler case is irreversible state — a closed account, a deleted record, a sent payment — where action replay would create a parallel reality that conflicts with what the user already experienced. In these classes, decision replay (counterfactual diff for internal use) may still be warranted, but action replay is off the table and the policy needs to say so explicitly.

The work product here is a matrix: rows are decision classes the agent makes, columns are required / optional / forbidden, and each cell has a named owner and a defined trigger. "When should we backfill" is not a question the on-call engineer should be answering at 2am.

The org failure mode: nobody owns the answer

The pattern repeats across companies. The applied AI team builds the eval pipeline and notices the disagreements. The legal team is brought in after the first uncomfortable diff lands in a Slack channel. The product team weighs in on customer messaging. The compliance team asks for an audit log they did not have requirements for six months ago. By the time everyone is in the room, the upgrade has already shipped or the rollback window has closed, and the conversation is about how to communicate, not how to decide.

The fix is unglamorous: name the owner of the backfill question before you ship the next model. That owner is not the engineer running the eval; it is the person who can sign off on "we will not be replaying refund decisions from before April 1," in writing, with the concurrence of legal and product. The eval pipeline produces evidence; the owner produces the decision. Without that role, the disagreement diff sits in a dashboard until someone outside the company forces the question.

The corollary is that "ship the new model" stops being a deployment event and becomes a release process with a checklist: provenance is captured, snapshot integrity is verified, the decision-class policy matrix is current, the replay capability has been exercised in dry-run mode this quarter, and the named owner has signed off on the disposition of any disagreement classes that exceed a pre-set threshold. It will feel like overkill the first time. It will feel like the bare minimum the first time you skip it and a regulator asks for the records.

What to do this week

You do not need to solve all of this before the next upgrade. You do need to know which side of the line you are on.

Three concrete actions are within reach for most teams in a single sprint. Audit the decision provenance you currently capture against the list above and write down the gaps. Run a decision-replay dry run on a small slice of recent traffic with whatever model you intend to upgrade to next, and surface the disagreement rate by decision class. Schedule the conversation with legal and product about the policy matrix before the eval results force it onto the calendar.

The trap is treating model upgrades as primarily a quality decision. Quality is the easy part — the new model usually wins on average, and the vendor's release notes will tell you so. The hard part is that average improvement, in agent systems, leaves behind a long tail of specific decisions whose owners now have a documented better answer and an undocumented question about what to do with it. Plan for the question, or let the question plan for you.

References:Let's stay in touch and Follow me for more thoughts and updates