Wall-Clock Deadline Drift: Why Your Agent Thinks It Has Time It Doesn't

May 10, 2026 · 9 min read

Software Engineer

A user clicks send. The agent is configured with a thirty-second budget. The planner inspects the task, sees a deep-research path that takes about twelve seconds and a quick lookup that takes three, and confidently picks the deep path because "we have plenty of time." Twenty-eight seconds later the response lands, two seconds past the SLA the team published last quarter. The dashboard says the agent's reasoning was correct. The retry logic was correct. The tool calls succeeded. Nobody can explain why the user's spinner sat for forty-six seconds.

The bug is not in any single component. It is in the seam between them, in a value the system never thought to refresh: the agent's belief about how much time is left. Somewhere between request acceptance and the model's next planning step, a transparent retry happened, the wall clock advanced, and the deadline metadata didn't. The model is now reasoning about a budget it cashed out fifteen seconds ago and doesn't know it.

This failure mode has a name in distributed systems: deadline drift. gRPC has fought it for a decade and the fix is well-understood — propagate deadlines as wall-clock-anchored values, deduct elapsed time at every hop, refuse work whose projected duration exceeds what remains. Agent harnesses are quietly rebuilding the same plumbing under a different name, and a lot of them are getting it wrong in the same spots that early gRPC users got it wrong. The difference is that an LLM, unlike a typed RPC client, will cheerfully reason about a stale number as if it were fresh, and produce a confidently wrong plan instead of a clear error.

The retry-budget split

The trap is structural. Most agent harnesses treat a tool call failure as a low-level concern: catch the transient error, back off, retry, return the result up to the planner as if the first attempt had simply taken a little longer. This is a clean abstraction at the boundary, and it works fine for the part it was designed for. The planner is freed from caring about flaky network paths. The model never sees the noise.

But the abstraction leaks the moment you ask the planner to reason about time. The planner has been told it has thirty seconds. It is making decisions — should I take the cheap path or the deep path, should I parallelize, should I bail and apologize — based on that number. The harness, meanwhile, has been silently consuming chunks of that thirty seconds on retried attempts the planner never learned about.

Each layer is correct in isolation. The retry layer hides transient failures, as designed. The planner allocates time against the deadline it was given, as designed. Together they produce a system that is uniformly late on requests with retries and uniformly punctual on requests without — a bimodal SLA pattern that doesn't show up in any aggregate metric until a customer files a ticket.

The deeper issue is that "deadline" was never a single concept. There are at least three:

The wall-clock deadline, anchored to when the user pressed enter.
The current attempt's deadline, anchored to when this retry started.
The model's belief about the deadline, anchored to whatever was put in its context.

If those three numbers ever diverge, the agent can be making correct decisions against the wrong frame of reference. The wider the divergence, the wronger the decisions, and the divergence grows monotonically with retry count.

What gRPC figured out

gRPC handled this decades ago and the design is worth borrowing wholesale. A deadline is a wall-clock instant, not a duration. When a server calls another server on behalf of a request, it does not pass along a fresh thirty-second budget. It computes deadline - now() and passes that. The downstream service inherits the original deadline minus the time already spent upstream, including any retries.

A few properties fall out of this design that are worth keeping in mind:

Time is monotonic and global to the request. There is exactly one source of truth — the original deadline — and every subsystem reasons in terms of "how much of that is left."
Retries cost time, not chances. A retry that takes eight seconds consumed eight seconds of the request's budget, and the next caller in the chain sees a budget eight seconds smaller. There is no "retry counter" that resets on each hop.
A subsystem can decline to start work if its expected duration exceeds the remaining budget. Better to fail fast with a clear DEADLINE_EXCEEDED than to begin a six-second operation when only two seconds remain.

Agent harnesses need each of these. They mostly have none of them.

How to stop lying to the model

The fix has four parts, and skipping any of them leaves the failure mode intact.

Anchor the deadline to the request, not the attempt. When the harness accepts a request, it stamps request_received_at and deadline = request_received_at + budget. Every retry, every tool call, every model invocation is judged against deadline, not against a duration that resets per attempt. The retry layer is responsible for not lying about how much time has elapsed across its own retries. If the user's wall-clock budget is thirty seconds and the harness has already spent eighteen on a failed-then-retried attempt, the next planner step sees twelve, not thirty.

Surface time_remaining to the model at every decision point. The planner cannot make budget-aware choices if its only signal is the static budget it was given at the start. It needs a field — call it whatever, but make it real and dynamic — that says "as of this turn, you have N seconds." This field is computed at the moment the planner is invoked, not when the request arrived. If the model is choosing between a deep-research path and a quick lookup, it should be choosing against time_remaining = deadline - now(), not against the original budget. The cheapest way to teach the model to reason about time is to tell it the truth.

Refuse work that won't fit. Tools and plans should declare an expected duration — a rough p95 is sufficient. The harness checks projected duration against time_remaining before dispatching. If the projection exceeds the budget, the request short-circuits to a fallback path or a graceful failure, not a hopeful start that will breach the SLA on completion. This is the agent equivalent of gRPC's deadline check at server entry: the smart move when there is no time left is to refuse the work, not to begin it and finish three seconds late.

Distinguish two latency signals in your observability. "How long the agent thought it was running" and "how long the user actually waited" are different numbers and you need both. The first one tells you whether the agent's plan was reasonable. The second one tells you whether the user got what was promised. If they diverge, you have deadline drift, and the gap is exactly the time the harness spent in retries the planner doesn't know about. A dashboard that only shows one of these numbers will hide the bug indefinitely.

The bimodal SLA pattern

The reason this slips past most teams is that the symptom looks like noise. Median latency is fine. p95 looks fine. Nothing in the aggregate metric screams "deadline bug." Then a customer complains about an unreasonable wait time, the team digs into a single trace, and finds a request that ran for forty seconds while every internal counter said it ran for twenty-eight.

Look for the pattern, not the magnitude. The fingerprint of deadline drift is a bimodal distribution in user-perceived latency that does not exist in agent-internal latency. Slice your latency by retry count. Requests with zero retries should have a latency distribution; requests with one retry should have a different, shifted-right distribution; requests with two retries should be shifted further. If those distributions overlap completely from the agent's point of view but spread out from the user's point of view, the harness is hiding retries from the planner and you are sitting on the bug.

Another signature: SLA breaches that correlate with tool flakiness rather than with task complexity. If the slow requests are the ones that hit a transient 503 from a third-party tool, not the ones that asked for genuinely hard work, the latency is being burned in invisible retries. The planner had no way to know it was already over budget when it greenlit the deep-research path.

Punctuality decays as reliability improves

Here is the part that should make engineering leads uncomfortable. The classic instinct, when retries are working, is to make them more aggressive. More attempts. Longer backoff. Wider error classification. Every one of those changes makes the retry layer more reliable and the deadline drift worse — because every retry that succeeds is a chunk of the user's budget burned without the planner's knowledge.

A team that invests in making its tool calls more durable, without simultaneously teaching its planner about wall-clock budgets, ends up shipping a system whose punctuality decays as its reliability improves. The 99th-percentile success rate creeps up, the user-visible 99th-percentile latency creeps up too, and the team can't connect the two trends because they look like wins on different dashboards.

The fix is not to stop retrying. The fix is to stop letting retries lie to the model.

What "deadline" actually means

The realization that has to land is that "deadline" is a property of the user-perceived task, not of any individual inference attempt. The user does not care whether the agent's third retry succeeded — they care whether their request completed inside the time the product promised. Every layer of the stack that resets the clock at its own boundary is an opportunity for the system to feel slower than its components.

Treat the deadline like a global, decaying budget owned by the request. Pass it down with every call. Refresh the model's view of it at every decision point. Refuse work that won't fit. Measure the gap between what the agent thinks happened and what the user experienced, and treat that gap as a first-class metric.

A reliable agent that is uniformly two seconds late on every retried request is not a reliable agent. It is a reliable agent with a deadline accounting bug, and the fix is in the seam between the retry layer and the planner — the place where you stopped telling each side what the other one was doing.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Wall-Clock Deadline Drift: Why Your Agent Thinks It Has Time It Doesn't

The retry-budget split

What gRPC figured out

How to stop lying to the model

The bimodal SLA pattern

Punctuality decays as reliability improves

What "deadline" actually means

Recommended Reading

About Tian Pan

The retry-budget split​

What gRPC figured out​

How to stop lying to the model​

The bimodal SLA pattern​

Punctuality decays as reliability improves​

What "deadline" actually means​

Recommended Reading

About Tian Pan

The retry-budget split

What gRPC figured out

How to stop lying to the model

The bimodal SLA pattern

Punctuality decays as reliability improves

What "deadline" actually means