Non-Blocking AI: Async UX Patterns That Keep Applications Responsive While Agents Work

May 5, 2026 · 11 min read

Software Engineer

Most teams discover the synchronous UI problem the same way: a user clicks "Generate report" and the browser tab goes silent for forty seconds. No spinner, no progress, just a frozen button. Half the users hit refresh and submit twice. The other half assume the product is broken and close the tab.

The root issue is not agent latency — it's that LLM-backed agents operate on timescales that break every assumption baked into synchronous request-response UX. A single GPT-4o call averages 8–15 seconds. A multi-step agent that searches the web, reads three documents, writes a draft, then formats the output can take two to four minutes. You cannot make that feel fast by optimizing the agent. You have to redesign the contract between your backend and your UI.

This article covers the full stack of patterns that make long-running agent work feel acceptable to users — from token streaming on the frontend to job queues on the backend, with the cancellation semantics and race condition fixes in between.

The Latency Stack Is Not Going Away

Before diving into patterns, it helps to understand why the latency problem is structural. A naive agent chain compounds delays at every layer:

LLM inference: 3–15 seconds per call at P50, often 30+ seconds at P95 under load
Tool calls: Each web search, database query, or API call adds 1–5 seconds
Multi-step reasoning: A chain of five LLM calls with tool use between them routinely hits two minutes
Structured output overhead: Constraining the model to JSON schemas eliminates streaming and adds another round-trip

The traditional fix — faster infrastructure — helps at the margins but doesn't change the order of magnitude. A 3-second response time is an abandonment trigger; sub-2-second P95 is the defensible bar for chat UX. No model speed improvement gets a complex agent chain under that threshold. The fix is to change what the user experiences while waiting, not to eliminate the wait.

Streaming First: Make Every Token Visible

The single highest-leverage change for agent UX is streaming output at the token level. Instead of waiting for the model to finish generating and then flushing the response, stream tokens to the client as they're produced. Users typically see the first token within 2–3 seconds even when total completion takes 60+ seconds — and a response that's visibly growing feels very different from a blank screen.

Server-Sent Events (SSE) is the right transport for most setups. It's unidirectional server-to-client, works over standard HTTP without upgrade handshakes, handles reconnection automatically, and integrates natively with the Fetch API. WebSockets make sense when you need bidirectional real-time communication (user is editing a document while the agent is also writing to it), but for the common case of "agent produces output, user reads it," SSE is simpler and sufficient.

The harder problem is structured output. If you're asking the model to return JSON — a table of results, a list of action items, a set of code changes — vanilla streaming breaks because you can't render partial JSON. The practical fix is to stream the response as raw text, use an incremental JSON parser (libraries exist for JavaScript, Python, and Go), and render completed fields as they become parseable. The Vercel AI SDK's createDataStreamResponse takes this further, letting you interleave structured metadata (retrieved documents, tool results) into the stream before the LLM even begins its response.

Optimistic Updates: Show Expected State Immediately

Streaming handles the output generation phase. Optimistic UI handles the period between the user's action and the first visible output — which for complex agents can still be several seconds.

The pattern is straightforward: when the user submits a request, immediately update the UI to reflect the expected outcome. Show the message in the conversation thread, mark the task as in-progress in the task list, add the new row to the table. If the backend operation succeeds (the normal case), the UI was already correct. If it fails, roll back to the previous state and surface an error.

React's useOptimistic hook formalizes this pattern at the framework level. You provide the current state and an update function; React applies your optimistic delta immediately and reverts automatically if the async action throws. Products like Linear use this extensively — creating an issue locally in a few milliseconds while the database write happens in the background.

The key design constraint is that optimistic updates work best for operations where failure is rare and rollback is cheap. For agent tasks that produce irreversible side effects (emails sent, files committed, payments processed), you don't want to optimistically show "email sent" before it actually sends. Use optimistic state for the intermediate "working" phase, but hold the success state until you have confirmation.

Skeleton States That Communicate Progress Semantically

A spinner says "something is happening." A skeleton state says "content of approximately this shape is loading here." A semantic progress indicator says "the agent has finished searching and is now writing." Each level gives users more information and more patience.

The research on skeleton screens is consistent: showing users where content will appear, and roughly what shape it will take, reduces perceived wait time significantly. For agent workflows specifically, semantic progress beats generic loading because it answers the question users actually have: "Has it started? Is it stuck?"

The pattern used by products like Perplexity and Cursor is to emit discrete progress events from the agent as it works, and render each event in the UI:

"Searching for recent papers on X..."
"Reading 3 documents..."
"Drafting response..."
"Formatting output..."

These aren't just cosmetic. They tell users the system is making progress toward their goal, not spinning on an error it hasn't surfaced yet. Windsurf's "Flows" concept takes this further — each agent step is an explicit UI construct that can be paused, inspected, and resumed independently.

The implementation requires your agent to emit structured events during execution, not just at completion. LangGraph does this natively via streaming mode; custom agents need an event bus or progress channel alongside the result channel.

Cancellation Without Leaving Wreckage

The most overlooked async UX problem is cancellation. Users cancel long-running tasks for legitimate reasons — they changed their mind, entered the wrong parameters, or just don't want to wait anymore. The question is what happens to work the agent has already done.

This splits into two categories:

Idempotent side effects (read-only database queries, web searches, analysis): Cancel freely. The work is discarded; nothing in the world changed.

Non-idempotent side effects (emails sent, API calls made, files written, payments charged): Cancel is ambiguous. The agent may have already performed the action before receiving the cancellation signal. You need idempotency keys on every state-changing operation so that if the agent retried after a dropped connection, you don't duplicate the effect. And you need the cancel flow to distinguish "we stopped before acting" from "we acted; here's what happened."

On the frontend, AbortController is the correct primitive. Each new invocation creates a new controller; submitting a new request aborts the previous one before the new fetch starts. This prevents the classic race condition where the user types a second query before the first completes, and the first response arrives last, overwriting the correct UI state.

For longer agent tasks with meaningful in-flight work, "pause" is often more user-appropriate than "cancel." Several AI code editors now expose pause/resume on agentic sequences explicitly — this lets users review intermediate output, redirect the agent, and continue rather than discard everything and start over.

Backend Architecture: Job Queues Over HTTP Long Polling

For agent tasks that take more than 10–15 seconds, the backend architecture needs to shift from synchronous HTTP to async job queues with status callbacks.

The pattern:

Client submits task → server enqueues job, returns job ID immediately (HTTP 202 Accepted)
Client polls job status endpoint or subscribes to a WebSocket/SSE channel for updates
Worker processes the job, emitting progress events
On completion, client is notified and fetches the result

BullMQ on Node.js is the practical choice for this pattern — Redis-backed, TypeScript-native, handles retries with exponential backoff, supports job priority and DAG dependencies (a parent job that can only complete after multiple child jobs finish). For workflows that need durability across restarts and explicit human approval steps, Temporal is the more structured option.

Webhooks serve the same purpose in service-to-service contexts. When an agent completes a task, it fires a webhook to a registered endpoint; the receiving service can then trigger downstream processing without polling. The 2025-2026 pattern is pairing webhooks with CloudEvents as the payload schema and ephemeral signing keys for verification.

One failure mode to watch: fire-and-forget without acknowledgment. If your backend enqueues a job and the client disconnects, you need the job to continue and the result to be retrievable when the client reconnects. Storing job state in Redis with a TTL, and exposing a result endpoint keyed by job ID, gives you this. Clients can always re-poll even after a browser refresh.

Race Conditions Are More Common Than You Think

When users interact with rapidly-changing UI backed by async agent calls, race conditions are the quiet failure mode that degrades data integrity without surfacing obvious errors.

The classic example: a search input that triggers an agent query on each keystroke. The user types "machine learning deployment", triggering five separate queries. The fifth query returns first with the correct result. But then queries one through four return out of order, and each one overwrites the UI state with stale data. The final displayed result is correct for "machi", not for the full query.

The fix requires intentional ordering in your frontend code:

Assign a monotonically increasing sequence number to each request
On response, only apply the update if the sequence number matches the most recent request
Use AbortController to cancel in-flight requests when a new one starts

The deeper variant of this problem occurs when the agent performs write operations and the UI state falls out of sync with backend state. The safest model is to treat the backend as the source of truth and re-fetch on completion, rather than patching local state incrementally. Optimistic updates are appropriate for predicted state; definitive state should come from the server response.

What the Leading Products Converged On

Looking across the products that handle agent UX well in 2025-2026, the convergence is striking:

Streaming is assumed, not optional
Side-panel or split-screen layout separates the in-progress work from the conversation history, so users can see agent output accumulating without losing context
Step-level progress events are surfaced in the UI, not just a generic spinner
Pause/resume is becoming standard for multi-phase tasks, replacing hard cancel-and-restart
Idempotency is treated as an infrastructure requirement, not an afterthought

The products that still frustrate users share a common flaw: they treat the agent as a black box that produces a result, rather than a process that can be observed and interrupted. The shift to async, observable agent execution isn't just UX polish — it's what lets users trust that the system is working rather than broken.

Starting Small

If you're integrating agent capabilities into an existing product, the incremental path:

Add token streaming to any LLM calls that block the UI. This alone eliminates the "frozen screen" problem.
Add progress events to any agent chain with more than two steps. Emit before each step, not just at the end.
Add cancellation via AbortController on the frontend and graceful interrupt handling on the backend. Test the non-idempotent paths explicitly.
Move long-running tasks to a job queue once you have tasks exceeding 30 seconds. This is an infrastructure change, but the user-facing improvement is significant.
Optimize from there — optimistic updates, skeleton states, and pause/resume are quality-of-life improvements that matter more as your agent task complexity grows.

The gap between a synchronous agent integration that users abandon and an async one they trust isn't usually in the model. It's in whether the application treats the user as a participant in the agent's work, or as a passive recipient waiting for a result.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Non-Blocking AI: Async UX Patterns That Keep Applications Responsive While Agents Work

The Latency Stack Is Not Going Away

Streaming First: Make Every Token Visible

Optimistic Updates: Show Expected State Immediately

Skeleton States That Communicate Progress Semantically

Cancellation Without Leaving Wreckage

Backend Architecture: Job Queues Over HTTP Long Polling

Race Conditions Are More Common Than You Think

What the Leading Products Converged On

Starting Small

Recommended Reading

About Tian Pan

The Latency Stack Is Not Going Away​

Streaming First: Make Every Token Visible​

Optimistic Updates: Show Expected State Immediately​

Skeleton States That Communicate Progress Semantically​

Cancellation Without Leaving Wreckage​

Backend Architecture: Job Queues Over HTTP Long Polling​

Race Conditions Are More Common Than You Think​

What the Leading Products Converged On​

Starting Small​

Recommended Reading

About Tian Pan

The Latency Stack Is Not Going Away

Streaming First: Make Every Token Visible

Optimistic Updates: Show Expected State Immediately

Skeleton States That Communicate Progress Semantically

Cancellation Without Leaving Wreckage

Backend Architecture: Job Queues Over HTTP Long Polling

Race Conditions Are More Common Than You Think

What the Leading Products Converged On

Starting Small