Streaming AI Applications in Production: What Nobody Warns You About
The first sign something is wrong: your staging environment streams perfectly, but in production every user sees a blank screen, then the entire response appears at once. You check the LLM provider — fine. You check the backend — fine. The server is streaming tokens. They just never make it to the browser.
The culprit, 90% of the time: NGINX is buffering your response.
This is the most common streaming failure mode, and it's entirely invisible unless you know to look for it. It also captures something broader about production streaming: the problems aren't usually in the LLM integration. They're in all the infrastructure between the model and the user.
TTFT is the only metric your users actually feel
Before getting into the failure modes, it's worth being precise about what you're optimizing for. LLM inference has two latency metrics that matter for user-facing applications:
TTFT (Time to First Token): The elapsed time from when the user submits a prompt to when the first token appears in the UI. This includes request queuing, prompt prefill processing, and the network round-trip. It's the primary perceived-latency metric.
TPOT (Time Per Output Token): The average time between consecutive tokens — the "reading speed" of the stream.
Users perceive a streamed response as roughly 40% faster than a buffered response with identical total latency. The effect is almost entirely from TTFT. Seeing the first token arrive in ~300ms signals system liveness and collapses the psychological weight of waiting. After that, as long as tokens arrive at a consistent pace — even a modest one — perceived responsiveness stays high.
Consistency matters more than raw speed. A stream that delivers tokens at a steady 20 tokens/second feels better than one that bursts 100 tokens, pauses for two seconds, bursts again. When you see complaints about "the AI feeling slow" despite acceptable total latency, look at TPOT variance, not mean TPOT.
Target thresholds by use case:
- Interactive chatbots: TTFT under 500ms
- Code completion / IDE tools: TTFT under 100ms
- Long-form generation: TTFT up to ~3 seconds is tolerable if token streaming remains smooth
- P95 alert threshold: 1-2 seconds for user-facing applications
The important tradeoff: minimizing TTFT requires smaller batch sizes at the inference layer, which reduces GPU throughput. For user-facing traffic, optimize TTFT first and accept the throughput cost.
The NGINX buffering trap (and the full proxy checklist)
NGINX's default configuration buffers proxy responses. When you add an NGINX reverse proxy in front of your streaming API endpoint, tokens accumulate server-side and flush in a burst — or never, until the response completes and the connection closes. This happens silently. There are no errors. The server logs show correct responses.
The required NGINX configuration for any SSE endpoint:
location /api/stream {
proxy_pass http://backend;
proxy_buffering off;
proxy_cache off;
gzip off;
proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_read_timeout 86400s;
proxy_send_timeout 86400s;
}
The proxy_buffering off directive is the critical one. gzip off matters too — compression operates on complete chunks and re-buffers the stream. The proxy_read_timeout increase is necessary because the default 60-second timeout will kill any response generating more than a minute of tokens. A Claude Opus response at 4,096 output tokens takes over two minutes at 30 tokens/second.
Your backend should also set the X-Accel-Buffering: no response header. This signals any Nginx-compatible upstream proxy to disable buffering, regardless of how it's configured. Consider it defense in depth.
CDNs compound the problem. Cloudflare, AWS CloudFront, and most other CDN providers buffer SSE by default. The simplest fix is to route streaming endpoints around the CDN entirely. If you need CDN coverage for other reasons, check vendor documentation for streaming passthrough configuration — it exists, but it's not the default.
Verify your configuration is actually taking effect. Run nginx -T to dump the merged effective config, not just the file you edited. Multiple config files, include directives, and location block inheritance can silently override your settings.
Heartbeats: the failure mode that takes 90 seconds to manifest
SSE connections with no data flowing will be closed by intermediate proxies, some CDNs, and certain browsers after 30-120 seconds of silence. This shows up as timeouts that are oddly correlated with thinking time — slow responses on hard prompts get cut off, fast responses on simple prompts work fine.
The fix is a heartbeat: send an SSE comment event every 15-30 seconds during any pause:
: heartbeat
SSE comment lines start with : and are ignored by clients. The blank line terminates the event. This keeps the connection alive through tool execution waits, long prefill processing, and any other pause where no tokens are flowing. It's two lines of code and prevents a category of timeout failures that can take weeks to diagnose in production.
The O(n²) JSON parsing problem you'll hit at scale
If your application streams structured outputs — function call arguments, JSON responses, typed data — you'll face an incremental parsing problem. Every intermediate chunk is invalid JSON. The natural impulse is to feed each new chunk to a JSON repair library and try to construct the partial object for real-time display.
This works in development with small responses. In production with larger responses, it causes visible stuttering and eventual timeouts. The reason: naive incremental parsing is O(n²). Each call to the repair library reparses the entire accumulated string from the beginning. For a 12 KB response delivered in small chunks, this means processing roughly 15 million total characters instead of 12,000. Real benchmarks show the degradation curve:
- Chunks 1-688: under 1ms per chunk
- Chunk 689: 3.2ms — first noticeable lag
- Chunk 954: 5.4ms — visible stuttering
- Chunk 1514: 16.2ms per chunk at 63% completion
The correct approach is stateful incremental parsing: maintain parser state between calls (last parsed index, current nesting context, incomplete token buffer). This reduces total processing from ~30 seconds to ~43ms for a 12 KB response — a 388x improvement. If you're building this yourself, the core insight is that each new chunk only requires processing the characters that are actually new, not the full accumulated string.
For cases where valid structure is a hard requirement (not just best-effort display), constrained decoding is the alternative: tools like Outlines and vLLM's structured output feature mask invalid tokens during generation so the model can only produce output conforming to a specified schema. This eliminates the partial-parsing problem entirely but requires inference-side infrastructure.
Resumable streams: the pattern that changes what's possible
The most consequential production streaming pattern is one most teams implement too late: separating the LLM generation process from the client connection.
Client connections are fragile. Laptops close. Networks drop. Users refresh mid-response. In the naive architecture — client sends request, server streams response — any connection interruption loses the partial response and forces re-generation. At scale, this wastes significant API spend and frustrates users.
The decoupled architecture:
- Client sends a generation request with a stable session ID (UUID, stored in localStorage)
- A stream generator service starts LLM inference and writes each chunk to a Redis Stream with XADD
- A separate stream consumer API reads from Redis and forwards chunks to the client over SSE
- If the client disconnects, generation continues uninterrupted
- On reconnect, the client sends its session ID; the consumer reads from the client's last-seen position using Redis consumer groups
This unlocks several properties:
- Refresh-safe: page refresh reconnects to the same stream, resuming from the last received token
- Multi-device: open the same session on a second device and pick up where you left off
- Cost-efficient: no re-generation on reconnect; you pay for each generation exactly once
- Idempotent: the session ID deduplicates generation triggers — sending the same ID twice returns the existing stream
Pair this with AbortController on the client for user-initiated cancellations. Wire the abort signal to both the fetch request and a backend cancellation API. This stops token generation at the inference layer, not just the connection — important for cost at scale since you stop paying for tokens that nobody will see.
Load balancing long-lived connections
Standard HTTP load balancing assumes short-lived requests. LLM streaming connections are long-lived — minutes for complex agentic tasks. This breaks two common defaults.
Round-robin load balancing distributes requests evenly at connection time. After an hour of production traffic, one backend might hold 800 active streams while another holds 100, depending on request timing. Use least_conn instead: route each new connection to the backend with the fewest active connections.
ip_hash sticky sessions break behind CDNs. All CDN traffic arrives from a small set of CDN edge IPs, so ip_hash routes everything to one or two backends. Use cookie-based affinity instead.
Connection draining on deploy: when rolling out a new version, backends will terminate active streams mid-response. Configure graceful termination periods at least as long as your maximum expected stream duration. On AWS ALB, this is deregistration_delay. On GCP, configure backend service timeout accordingly.
On GCP specifically: the HTTP(S) load balancer has a 30-second default backend timeout and buffers streaming responses. Use TCP load balancing with client IP affinity, or externalize all per-request state to Redis to make backends stateless and avoid sticky session requirements entirely.
Production memory and connection management
Several widely-used LLM proxy libraries have documented memory leak issues under sustained load:
- LiteLLM: async streaming handlers accumulate memory with each call, causing OOM crashes after several hours at load. At 500 RPS sustained traffic, expect 8% timeout rates requiring process restarts every 6-8 hours.
- vLLM: slow linear memory growth (~400 MB/minute) under production traffic, traced to reference cycles in async generator cleanup.
- LangChain: memory growth from conversation history accumulation and uncleaned vector embeddings; containers OOM after hundreds of LLM calls in long-running processes.
The mitigations: always wrap stream consumers in try/finally to guarantee cleanup. Use context managers for generator objects. Monitor memory over time in staging under production-like traffic patterns before shipping. In Python, GC pauses, lock contention, and asyncio event loop starvation all worsen nonlinearly between 10 concurrent connections (development) and 1,000 (production). Test at realistic concurrency.
The infrastructure checklist
For any streaming endpoint going to production:
- NGINX:
proxy_buffering off,gzip off,proxy_read_timeout 86400s,proxy_http_version 1.1,X-Accel-Buffering: noresponse header - Load balancer:
least_connstrategy, cookie-based sticky sessions, graceful connection draining on deploy - CDN: route streaming endpoints around CDN or verify streaming passthrough configuration
- Heartbeats: SSE comment event every 15-30 seconds during any generation pause
- Timeouts: review and increase every timeout in the request path — load balancer, proxy, backend service — to exceed your maximum expected response duration
- Memory: load test at production concurrency, not development concurrency; set up memory growth alerts
- Abort handling: wire
AbortControllerto both client fetch and backend cancellation; measure cost savings from early termination - Resumability: consider the Redis-backed decoupled architecture for any application where partial responses have real cost or user impact
The streaming layer is rarely where interesting AI work happens, but it's consistently where production reliability is lost. The patterns above don't require deep infrastructure expertise to implement — they just require knowing they exist before you're debugging them at 2am.
- https://upstash.com/blog/resumable-llm-streams
- https://www.aha.io/engineering/articles/streaming-ai-responses-incomplete-json
- https://pockit.tools/blog/streaming-llm-responses-web-guide/
- https://medium.com/@daniakabani/how-we-used-sse-to-stream-llm-responses-at-scale-fa0d30a6773f
- https://medium.com/@saifaliunity/why-gcp-load-balancers-struggle-with-stateful-llm-traffic-and-how-to-fix-it-1ff6736c1052
- https://www.clarifai.com/blog/ttft-vs-throughput
- https://oneuptime.com/blog/post/2025-12-16-server-sent-events-nginx/view
- https://platform.claude.com/docs/en/build-with-claude/streaming
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/fine-grained-tool-streaming
- https://dasroot.net/posts/2026/02/rate-limiting-backpressure-llm-apis/
- https://dev.to/debmckinney/litellm-broke-at-300-rps-in-production-heres-how-we-fixed-it-5ej
- https://arxiv.org/html/2510.02758v1
- https://blog.logrocket.com/nextjs-vercel-ai-sdk-streaming/
- https://bentoml.com/llm/inference-optimization/llm-inference-metrics
- https://compute.hivenet.com/post/llm-streaming-sse-websockets
