Skip to main content

The Agent Finished Into an Empty Room: Stale-Context Delivery for Async Background Tasks

· 10 min read
Tian Pan
Software Engineer

A background agent that takes ninety seconds to finish a task is operating on a snapshot of the world from ninety seconds ago. By the time it returns, the user may have navigated to a different view, started a new conversation, archived the original request, or closed the tab entirely. Most agent frameworks ship the result anyway, mutate state to reflect it, and treat the round trip as a success. It is not a success. It is the agent finishing into an empty room.

The failure mode is uglier than dropping the result. A dropped result is a missed delivery — annoying but recoverable. An applied stale result is an answer to a question the user is no longer asking, written against state that no longer matches, often overwriting the work the user moved on to. The user notices that something they did not ask for has happened, cannot reconstruct why, and loses trust in the system in a way that a simple timeout never would.

The fix is not faster agents. It is a delivery-time relevance gate that treats the moment of return as a fresh decision, not the foregone conclusion of the moment of dispatch.

The dispatch-time fallacy

The default mental model for an async agent task is request-reply with a longer wait. The user asks for something, the agent works on it, and at some point the answer comes back. Frameworks lean into this framing — continuation tokens, polling endpoints, completion webhooks, push notifications when the job is done. The plumbing is solid.

What the plumbing does not encode is that the input to the task is not just the prompt at dispatch. The input is the entire trajectory of session state from dispatch to delivery. If the user said "find the cheapest flight to Tokyo for next Tuesday" and then ninety seconds later they have already navigated to their calendar, deleted next Tuesday's meeting, and rebooked the trip for the following month, the original answer is not just stale — it is wrong against a state of the world the agent never saw.

The dispatch-time fallacy assumes that whatever was true when the user asked is still the question the agent is answering. In a synchronous chat turn, that assumption holds because the user is staring at the loading spinner. In a background task, that assumption is the bug.

Why ninety seconds is the dangerous regime

Sub-second responses live in the synchronous world; the user is locked to the conversation. Multi-day tasks live in the explicit-handoff world; the user has filed the ask and gone home, and the result arriving in tomorrow's inbox is exactly what they expect.

The dangerous regime is the middle band — long enough that the user does not wait, short enough that they have not mentally filed the task as out-of-band. Thirty seconds to five minutes is roughly where the empty-room problem peaks, because the system is acting on the assumption that the user is still on the same page they were on at dispatch, and the user is acting on the assumption that they are free to do something else.

This is also the band where agent frameworks are pushing hardest. Deep research agents that run for tens of minutes are large enough that everyone designs an inbox for them. Sub-second tool calls are small enough that nobody bothers with async at all. The middle is where the lazy default — "we'll just send the result back to wherever the user was" — produces the most damage.

A relevance gate at delivery time

The architectural shift is to insert a check at the moment of delivery, not just the moment of dispatch. The check answers one question: is the result still applicable to the current session state?

A workable shape for that gate has three parts.

First, snapshot the relevant session state at task-start and ship it with the task. This is not the full transcript. It is whatever the agent's reasoning depended on — the active document, the conversation thread id, the page or tool context, the parameters that made the prompt make sense. Hash it.

Second, at delivery time, recompute the same hash against the current session state and compare. Equal hashes mean the user has not moved; the result can apply directly. Different hashes mean the user has moved, and the system needs to reason about whether the difference matters.

Third, classify the result against that delta into one of three buckets. Still-applicable when the change is cosmetic — the user navigated within the same context, scrolled, or made an edit that does not invalidate the answer. Advisory-only when the change is real but the result is still useful as information rather than an action — surface it as a notification the user can choose to apply, not a state mutation. Discard when the change has invalidated the underlying question — log the result, tell the user the original task was overtaken by events, do not silently apply anything.

The default for ambiguous cases should be advisory-only, not still-applicable. When in doubt, surface a notification, do not mutate.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates