Skip to main content

Who Gets Paged When the Agent Is Wrong: On-Call for Non-Deterministic Systems

· 9 min read
Tian Pan
Software Engineer

The on-call rotation was built around a promise: failures reproduce. An alert fires, you re-run the request, you watch the bug happen, you find the bad commit, you roll back the deploy. Every part of that loop assumes determinism. The same input produces the same output, and the output is either right or wrong in a way you can stare at.

An agent fleet quietly breaks every link in that chain. The failure happened once, at a sampling temperature you can't replay, on a context window that has since been garbage-collected. There is no bad commit, because the code never changed — the model did, or the retrieved documents did, or the user phrased the request in a way nobody anticipated. You roll back the deploy and the deploy was never the problem.

So the page goes out, an engineer picks it up, and they discover the most uncomfortable fact about operating agents in production: they have been handed a system they cannot single-step, and the runbook in front of them was written for a different kind of machine.

Why "Reproduce, Then Fix" Collapses

Traditional debugging is a search procedure. You have a failing input, you have a deterministic system, and you bisect: narrow the code path, narrow the data, until the defect is cornered. The procedure terminates because every step is repeatable.

Agents violate the precondition. The same prompt sampled twice may follow two different reasoning paths, call different tools, and arrive at different answers — one correct, one not. Re-running the request doesn't reproduce the failure; it draws a fresh sample from a distribution. You're not bisecting a bug, you're playing a slot machine and hoping it lands on the same wrong outcome.

This matters more than it sounds, because it inverts the economics of debugging. In a deterministic system, the expensive part is finding the failure and the cheap part is confirming the fix. In an agent system, the failure may be cheap to observe in aggregate — your dashboards show a 3% rise in bad outcomes — but a single instance is nearly impossible to summon on demand. You cannot fix what you cannot make happen again.

There is also no rollback in the sense on-call engineers mean it. Rolling back the agent deployment does not roll back the email it already sent, the refund it already issued, or the database migration it already wrote. The blast radius escaped the system boundary the moment the agent called a tool. The incident isn't "the service is down" — it's "the service did seven things, three of them were wrong, and two of those can't be undone."

The Runbook Has to Be Rewritten

A runbook for deterministic systems is a sequence of deterministic actions: check this dashboard, restart this process, revert this commit. None of those steps survive contact with an agent incident. The runbook needs three new disciplines baked in.

Capture state at failure time, or you have nothing. Because you cannot reproduce the failure later, the only debuggable artifact is whatever you recorded when it happened. That means the full prompt as actually assembled, the retrieved context, the tool call arguments and responses, the model and version string, and ideally the sampling parameters. A run-level trace — one trace per request, with a span for every planning step, tool call, and model invocation — turns a non-reproducible bug into a concrete record you can read after the fact. Without it, the post-incident investigation is archaeology with no artifacts. The flight recorder is not a nice-to-have; it is the entire substrate of agent debugging.

Escalate to "tune the prompt or the eval," not "revert the commit." The remediation verbs are different. When an agent regresses, the fix is rarely a code revert. It's adding a failing case to the eval set, tightening a tool description, constraining the agent's permissions, or adjusting a system prompt. The runbook's escalation ladder should name these explicitly, because an engineer trained on revert-the-commit will stare at a clean git log and conclude there is nothing to do.

Treat a model-provider regression as an upstream incident you cannot patch. Sometimes the failure is not in your system at all. A provider silently updates model weights, recalibrates safety filters, or changes decoding defaults behind a stable API version string. Your code is identical; your behavior is not. The runbook needs a branch for this: how to confirm it (compare current outputs against a frozen reference set), how to mitigate it (pin to an older snapshot if one exists, route to a fallback model, tighten output validation), and the acknowledgment that you may simply have to wait. This is the operational equivalent of a dependency you cannot fork.

For the actions an agent has already taken, the runbook needs a containment section organized by reversibility. Idempotent API calls can be reversed where the provider supports it. For irreversible actions — a payment, a sent message, a published document — there is no technical undo, so the runbook must specify the substitute: the customer-communication template, the compensating action, the financial reserve that absorbs the loss. "Roll back" is replaced by a taxonomy of what can and cannot be unwound.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates