The Streaming Infrastructure Behind Real-Time Agent UIs
Most agent streaming implementations break in one of four ways: the proxy eats the stream silently, the user closes the tab and the agent runs forever burning tokens, the page refreshes and the task is simply gone, or a tool call fails mid-stream and the agent goes quietly idle. None of these are model problems. They are infrastructure problems that teams discover in production after their demo went fine on localhost.
This post is about that gap — the server-side architecture decisions that determine whether a real-time agent UI is actually reliable, not just impressive in a demo environment.
SSE vs. WebSocket: Why Every Major LLM API Chose SSE
When OpenAI, Anthropic, and Google shipped their streaming APIs, they all made the same choice: Server-Sent Events over HTTP, not WebSockets. This isn't inertia; it reflects a specific set of trade-offs that favor SSE for the core token delivery use case.
SSE runs over plain HTTP/HTTPS with no protocol upgrade handshake. That means it works transparently with every reverse proxy, CDN, and load balancer that already understands HTTP. Horizontal scaling is stateless — no sticky sessions, no socket brokers. The browser EventSource API handles reconnection automatically using Last-Event-ID, for free. And with HTTP/2, the old objection that "SSE is limited to 6 concurrent connections per domain" is simply gone — Chrome allows 100 concurrent SSE streams over a single H2 connection.
WebSockets give you one thing SSE doesn't: true bidirectionality over a single TCP connection. The client can send data to the server while the server is actively streaming back. For most LLM token delivery, you don't need this — the user submitted a request, the model responds, done. Where bidirectionality actually matters:
- Human-in-the-loop tool approval while streaming is in progress
- Real-time collaborative AI editors where the user can steer mid-generation
- Voice agents (which typically use WebRTC for the audio channel, not WebSockets)
One team replaced SSE with WebSockets in a token-streaming prototype and rolled it back within three days. Load balancers required special configuration, reconnect logic became non-trivial, and observability tooling that understood HTTP no longer applied. The lesson: WebSocket complexity compounds at the infrastructure layer. Only pay that cost when you genuinely need bidirectionality.
A reasonable default: use SSE for token delivery and agent progress events, and add a separate HTTP POST endpoint for signals (cancellation, approval) that the client sends back to the server. This keeps the streaming path simple while handling the cases that require client-to-server communication.
The Proxy Buffering Problem Nobody Warns You About
The most common production streaming failure isn't in your application code — it's at the infrastructure layer. Nginx, Cloudflare, and AWS ALB buffer HTTP responses by default. With buffering on, your streaming endpoint works perfectly in local testing and then silently delivers all tokens as a single batch dump at the end in production.
For nginx, the fix is explicit:
proxy_buffering off;
add_header X-Accel-Buffering no;
Cloudflare has a specific bug where GET requests through Cloudflare Tunnels (cloudflared) are buffered until the connection closes. The workaround is to use POST requests for your SSE endpoints instead of GET when routing through Cloudflare Tunnels. For Cloudflare proxy (not tunnels), send a heartbeat comment every 30 seconds to prevent the 100-second unresponsive connection timeout from killing the stream:
: heartbeat\n\n
Always flush headers immediately before writing the first data chunk. In Node.js: res.flushHeaders(). In FastAPI: return a StreamingResponse with an async generator — the framework handles flushing per yield. Without explicit flushing, the response headers and first chunk may get batched together by the TCP stack, introducing noticeable lag before the first token appears in the UI.
Backpressure: When the Model Outruns the Client
At 80-100 tokens per second on a fast inference endpoint, the model can outpace what the client's network can absorb. Naive implementations either buffer everything in memory (crash eventually) or drop tokens silently. Neither is acceptable.
In Node.js, res.write() returns false when the kernel write buffer is full. The correct pattern:
const canContinue = res.write(data);
if (!canContinue) {
await new Promise(resolve => res.once('drain', resolve));
}
Waiting on the drain event suspends the token write loop until the kernel buffer clears. This applies cooperative backpressure up through your application layer to the inference request itself, naturally throttling generation to match what the client can consume.
For WebSocket connections, monitor ws.bufferedAmount. When it exceeds your threshold (64KB is a common starting point), pause the upstream generator until it drains.
For Python async servers (FastAPI, Starlette), async generators implement implicit backpressure — the generator only advances when the framework's event loop calls __anext__(), which it does only when the previous chunk was sent. The infrastructure handles it; you just need to write a proper async generator rather than collecting all chunks and yielding at the end.
The failure mode to watch for in high-throughput deployments: if your inference server (vLLM, TGI) serves multiple concurrent streams and some clients apply backpressure, the inference server may stall GPU memory for those slow clients while faster clients wait. Monitor per-connection queue depth and implement connection-level timeouts for clients that fall too far behind.
Graceful Cancellation: What Happens When Users Click Stop
When a user abandons a long-running agent task, three things need to happen: the LLM generation stops, the server-side cleanup runs, and orphaned sub-processes don't keep burning tokens.
On the client side, AbortController signals cancellation:
const controller = new AbortController();
const response = await fetch('/api/agent', {
method: 'POST',
signal: controller.signal
});
// User clicks stop:
controller.abort();
The browser sends a connection reset, which surfaces as a signal on the server. In Node.js: listen on req.on('close', ...). In frameworks like Fastify or Koa, ctx.req.signal or req.signal propagates the abort. Forward it to the LLM provider call:
const result = await streamText({
model,
prompt,
abortSignal: req.signal
});
Most LLM provider SDKs accept an abortSignal parameter and will cancel the in-flight request, releasing any reserved inference capacity.
Server-side cleanup is where teams consistently miss steps. When a stream aborts:
- Release KV cache blocks at the inference layer. If you're self-hosting (vLLM), not canceling the generation request leaves KV cache blocks reserved for the abandoned session, reducing capacity for other users.
- Persist partial results if resumption is expected. If users can restart tasks, save whatever tokens had been delivered before the abort.
- Release database connections and locks. Agent tasks often hold open transactions or advisory locks for the duration of the run.
For agents orchestrated with Temporal, cancellation must be explicit. Place every LLM call inside a Temporal Activity, not Workflow code. Set a HeartbeatTimeout calibrated to your expected inference latency plus replay headroom. When the parent workflow is cancelled, the next heartbeat propagates that signal to the in-flight activity. Without this, child workflows become orphaned — they continue running, accumulating cost and event history, until they either complete or hit Temporal's 51,200-event workflow limit.
Reconnection: When the Browser Refreshes Mid-Task
This is the hardest streaming infrastructure problem, and most agent frameworks provide no answer to it. A user starts a 10-minute research task, their browser refreshes (tab crash, accidental F5, mobile background-kill), and the task vanishes. From the agent's perspective, the stream had a client disconnect — it may or may not continue running depending on your server architecture, but the client has no idea where to reconnect.
The production solution uses three tiers:
Tier 1 (KV store): Maps task_id → active_stream_id with lifecycle state: pending → ongoing → complete. This is the reconnection lookup index.
Tier 2 (stream buffer): Stores all agent event chunks keyed by task_id + stream_id, independent of client connection status. Redis Streams works well here — ordered, persistent, subscribable.
Tier 3 (SSE relay): When a client reconnects (via GET to /api/task/{id}/stream), the relay looks up active_stream_id from Tier 1, subscribes to Tier 2 for that stream, and relays buffered chunks via SSE from the beginning. If the task already completed, return HTTP 204 and let the client render the stored result.
The critical design constraint: the buffer in Tier 2 must be written by the server-side agent process, not by the SSE endpoint. The agent emits events to the buffer regardless of whether anyone is listening. The SSE endpoint is purely a relay — clients connect, replay, and track their position via Last-Event-ID.
Several race conditions appear in production:
- The stream buffer may have chunks that arrived after the task marked itself
complete. Always drain the buffer fully before rendering completion. - If your chat framework assigns message IDs only when generation finishes, you can't use them as stream IDs during generation. Use a separate UUID generated at task start.
- On page re-mount, check
active_stream_idbefore attempting reconnection. If it's null, the task either hasn't started or already completed — don't issue a reconnection request that will fail.
The native SSE Last-Event-ID reconnection mechanism (where the browser automatically reconnects and the server replays from that ID) handles short-lived streams well but is insufficient for multi-minute agent tasks. Full page navigation clears EventSource state, breaking the reconnection chain. Use the three-tier architecture for tasks that may run longer than a typical browser session.
The Layered Event Model for Multi-Step Agent Output
For single-turn token streaming, a flat stream of text deltas is sufficient. For multi-step agents — where the UI needs to show tool calls in progress, intermediate reasoning, parallel sub-tasks, and final synthesis — you need a structured event hierarchy.
The emerging standard, formalized in the AG-UI protocol and implemented independently by the OpenAI Agents SDK and LangGraph, organizes events into three layers:
Layer 1 (raw): Token deltas, raw LLM output. Used by chat UIs that just display text.
Layer 2 (semantic): tool_called, tool_result, message_complete, agent handoff events. Used by agent dashboards that need to show what the agent did.
Layer 3 (lifecycle): run_started, run_finished, run_error, checkpoint_saved. Used by monitoring systems and orchestration coordinators.
UIs subscribe to the layer they need. Adding Layer 2+ visibility is what turns "a spinner and then a response" into a UI that shows the agent searching three databases, reading five files, and synthesizing results — which meaningfully changes whether users trust the output.
One failure mode specific to streaming tool calls: Claude (and other providers) stream tool call inputs incrementally as partial JSON. The accumulation contract is strict — input_json = "", append each delta, parse only on content_block_stop. If your max_tokens limit is hit mid-tool-call, the accumulated JSON is incomplete and JSON.parse() will throw. Always check stop_reason before parsing tool inputs. If it's "max_tokens" with an in-progress tool call, the tool call is malformed and the session needs a recovery path, not a parse attempt.
The Failure Recovery Architecture
Two failure modes in multi-step agent streams kill sessions silently:
Truncated tool calls: A network glitch truncates the streaming response so the final message has stop_reason: "tool_use" but zero tool use content blocks. The agent treats this as a valid tool-use turn, finds nothing to execute, and either loops or goes idle. Prevention: validate that every message with stop_reason: "tool_use" contains at least one complete tool use block before proceeding.
Orphaned tool results: A crash between emitting a tool call and receiving the tool result leaves the conversation history with a tool_result that has no matching tool_use. Most LLM APIs reject this state as a hard error — the session becomes unrecoverable without manual history editing. Prevention: persist the assistant message and the tool results atomically. Write both to your store in a single transaction before the agent loop proceeds.
For agents running on durable execution infrastructure (Temporal, LangGraph's PostgreSQL checkpointer), the persistence and replay semantics handle most of this automatically. For agents built on raw HTTP with custom state management, these are manual invariants you have to enforce.
The recovery pattern for transient tool failures: return the error as a tool_result with is_error: true. The LLM receives it as content, can decide to retry with different parameters or escalate. This is better than raising an exception that aborts the session, because the LLM has context the exception handler doesn't.
What to Instrument
The standard monitoring set for streaming production systems:
- TTFT (Time to First Token): Perceived responsiveness. Most provider outages manifest before the first token returns, not mid-stream. This is the window where fallback logic is viable.
- Stream completion rate: Track separately by user abort vs. server error vs. normal completion. A rising abort rate often means latency problems, not user behavior changes.
- Connection duration P99: Set infrastructure timeouts above this, not at it. The common mistake is setting nginx's
proxy_read_timeoutto 60 seconds when your P99 task duration is 55 seconds. - Chunk gap time: Time between successive chunks. Spikes indicate inference stalls, not network issues.
- Reconnection success rate: If you've built resumable streams, measure whether reconnections actually replay correctly. Silent failures here mean users lose tasks without knowing why.
The debugging pattern that surfaces the most issues: log every stream lifecycle event (start, first chunk, last chunk, close, error, abort) with a common stream_id and correlate them. Streams that start but never close indicate connection leaks. Streams that close immediately after start indicate proxy buffering issues. Large gaps between the last chunk and close indicate your post-processing (persistence, compaction) is blocking the stream close.
Real-time agent UIs are fundamentally different from API-over-HTTP applications. The infrastructure requirements — persistent connections, ordered delivery, resumability, backpressure, graceful cancellation — don't have defaults that work. Each one requires an explicit design decision, and the failure modes when they're missing tend to be silent rather than loud.
- https://compute.hivenet.com/post/llm-streaming-sse-websockets
- https://procedure.tech/blogs/the-streaming-backbone-of-llms-why-server-sent-events-(sse)-still-wins-in-2025
- https://docs.ag-ui.com/introduction
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/fine-grained-tool-streaming
- https://platform.claude.com/docs/en/build-with-claude/streaming
- https://openai.github.io/openai-agents-python/streaming/
- https://ai-sdk.dev/docs/advanced/stopping-streams
- https://ai-sdk.dev/docs/ai-sdk-ui/chatbot-resume-streams
- https://stardrift.ai/blog/streaming-resumptions
- https://upstash.com/blog/realtime-ai-sdk
- https://www.xgrid.co/resources/temporal-ai-agent-orchestration-failure-patterns/
- https://www.assembled.com/blog/your-llm-provider-will-go-down-but-you-dont-have-to
- https://modelcontextprotocol.io/specification/2025-06-18/basic/transports
- https://vercel.com/changelog/node-js-vercel-functions-now-support-request-cancellation
- https://community.cloudflare.com/t/using-server-sent-events-sse-with-cloudflare-proxy/656279
- https://docs.vllm.ai/en/stable/design/metrics/
