Skip to main content

The Agent Finished Into an Empty Room: Stale-Context Delivery for Async Background Tasks

· 10 min read
Tian Pan
Software Engineer

A background agent that takes ninety seconds to finish a task is operating on a snapshot of the world from ninety seconds ago. By the time it returns, the user may have navigated to a different view, started a new conversation, archived the original request, or closed the tab entirely. Most agent frameworks ship the result anyway, mutate state to reflect it, and treat the round trip as a success. It is not a success. It is the agent finishing into an empty room.

The failure mode is uglier than dropping the result. A dropped result is a missed delivery — annoying but recoverable. An applied stale result is an answer to a question the user is no longer asking, written against state that no longer matches, often overwriting the work the user moved on to. The user notices that something they did not ask for has happened, cannot reconstruct why, and loses trust in the system in a way that a simple timeout never would.

The fix is not faster agents. It is a delivery-time relevance gate that treats the moment of return as a fresh decision, not the foregone conclusion of the moment of dispatch.

The dispatch-time fallacy

The default mental model for an async agent task is request-reply with a longer wait. The user asks for something, the agent works on it, and at some point the answer comes back. Frameworks lean into this framing — continuation tokens, polling endpoints, completion webhooks, push notifications when the job is done. The plumbing is solid.

What the plumbing does not encode is that the input to the task is not just the prompt at dispatch. The input is the entire trajectory of session state from dispatch to delivery. If the user said "find the cheapest flight to Tokyo for next Tuesday" and then ninety seconds later they have already navigated to their calendar, deleted next Tuesday's meeting, and rebooked the trip for the following month, the original answer is not just stale — it is wrong against a state of the world the agent never saw.

The dispatch-time fallacy assumes that whatever was true when the user asked is still the question the agent is answering. In a synchronous chat turn, that assumption holds because the user is staring at the loading spinner. In a background task, that assumption is the bug.

Why ninety seconds is the dangerous regime

Sub-second responses live in the synchronous world; the user is locked to the conversation. Multi-day tasks live in the explicit-handoff world; the user has filed the ask and gone home, and the result arriving in tomorrow's inbox is exactly what they expect.

The dangerous regime is the middle band — long enough that the user does not wait, short enough that they have not mentally filed the task as out-of-band. Thirty seconds to five minutes is roughly where the empty-room problem peaks, because the system is acting on the assumption that the user is still on the same page they were on at dispatch, and the user is acting on the assumption that they are free to do something else.

This is also the band where agent frameworks are pushing hardest. Deep research agents that run for tens of minutes are large enough that everyone designs an inbox for them. Sub-second tool calls are small enough that nobody bothers with async at all. The middle is where the lazy default — "we'll just send the result back to wherever the user was" — produces the most damage.

A relevance gate at delivery time

The architectural shift is to insert a check at the moment of delivery, not just the moment of dispatch. The check answers one question: is the result still applicable to the current session state?

A workable shape for that gate has three parts.

First, snapshot the relevant session state at task-start and ship it with the task. This is not the full transcript. It is whatever the agent's reasoning depended on — the active document, the conversation thread id, the page or tool context, the parameters that made the prompt make sense. Hash it.

Second, at delivery time, recompute the same hash against the current session state and compare. Equal hashes mean the user has not moved; the result can apply directly. Different hashes mean the user has moved, and the system needs to reason about whether the difference matters.

Third, classify the result against that delta into one of three buckets. Still-applicable when the change is cosmetic — the user navigated within the same context, scrolled, or made an edit that does not invalidate the answer. Advisory-only when the change is real but the result is still useful as information rather than an action — surface it as a notification the user can choose to apply, not a state mutation. Discard when the change has invalidated the underlying question — log the result, tell the user the original task was overtaken by events, do not silently apply anything.

The default for ambiguous cases should be advisory-only, not still-applicable. When in doubt, surface a notification, do not mutate.

What the snapshot needs to capture

The temptation is to over-snapshot — store the full session, the full DOM state, every preference. This produces a hash that is too sensitive: every cosmetic change invalidates the result, the discard rate goes through the roof, and engineers turn the gate off because it is too noisy.

The other extreme — snapshotting just the prompt text — is what most systems do today. It is too insensitive: the system cannot tell that the user has switched to a different document, started a new thread, or undone the action that motivated the task in the first place.

The right granularity is the causal envelope of the prompt: the smallest set of state that, if held constant, makes the original prompt mean what it meant. For a "summarize this document" task, that's the document id and version. For a "find the cheapest flight" task, that's the trip parameters plus the calendar window. For a coding agent's "fix this test" task, it's the file paths, the test name, and the working tree hash. The discipline is to ask, for each task type, what would have to change about the world for the answer to no longer be the answer the user wanted.

This is not a generic property of the framework. It is a per-task contract that the agent author owns, the way an HTTP endpoint owns its idempotency key.

Notification, not auto-application

When the gate flags a result as advisory-only or discarded, the UX matters as much as the architecture. The default in most assistants is to apply results in-place — the email is drafted, the document is edited, the calendar event is created. Apply-in-place is fine when the user is watching. It is hostile when the user has moved on, because the appearance of state they did not direct erodes their model of what the system is doing on their behalf.

The pattern that works is to distinguish two delivery modes in the UI itself. "Your earlier ask is ready" is a notification, scoped to the original task, that the user opens deliberately when they are willing to switch contexts. "I'm continuing what we were doing" is in-line state mutation, only valid when the gate confirmed the user is still in the same context.

The two modes look different. One has a badge, an inbox, an explicit accept-or-discard step. The other has no chrome at all because the user is already in the loop. Conflating them is what produces the surprise — applied state with no apparent cause — that destroys trust.

What to evaluate

Eval suites for async agents tend to score the agent on the quality of the answer it produces, sliced by task type and difficulty. That misses the empty-room failure entirely, because the answer is fine; the delivery context is broken.

The eval that catches this needs adversarial scenarios where the user's session state changes mid-task. Construct test cases that dispatch a task, mutate the session deterministically — switch documents, start a new thread, reverse the original action — and assert what the system does next. The pass criterion is not that the answer is correct. It is that the system either correctly applies, correctly demotes to advisory, or correctly discards, given the delta.

A few specific scenarios worth in the suite:

  • Context-switch mid-flight: dispatch the task, navigate to a different document, wait for delivery. Assert the result is delivered as a notification, not auto-applied.
  • Reversal mid-flight: dispatch the task, undo the action that motivated it. Assert the result is discarded with a user-facing log entry, not silently applied to a now-irrelevant state.
  • No-op mutation: dispatch the task, scroll the document, wait for delivery. Assert the result is delivered in-place; the gate did not over-trigger on a cosmetic change.
  • Re-asked variant: dispatch the task, start a related but different task before the first finishes. Assert that both results are routed correctly and neither overwrites the other.

The aggregate metric is not accuracy. It is the rate of stale-application errors per thousand background completions, broken out by delta type. A system that drops to zero application errors by silently discarding everything is not winning; the eval needs to measure both false-applies and false-discards.

The invariant to internalize

The real architectural realization is that an async agent task is not a function call. A function call's input is fixed at the moment of invocation, and the only output is whatever the function returns. An async agent task's input is the trajectory of session state from start to delivery, and the system's behavior at delivery is part of the contract, not an afterthought.

Once you accept that, the design pressure shifts. Continuation tokens are necessary but not sufficient. State persistence is necessary but not sufficient. What is sufficient is a delivery-time relevance check, a per-task definition of what counts as a meaningful state change, and a UX that treats notification and in-line application as different first-class modes.

The teams that get this right will be the ones that ran an eval against the empty-room scenario before a user noticed it in production. The teams that get this wrong will spend a quarter trying to figure out why their async feature has a higher complaint rate than the synchronous one it replaced — and the answer will be that the async path is doing exactly what the sync path does, ninety seconds too late, against state the user has already left behind.

References:Let's stay in touch and Follow me for more thoughts and updates