Skip to main content

The Streaming Response Your Backend Infrastructure Was Not Built For

· 12 min read
Tian Pan
Software Engineer

Streaming was a product decision. Somebody on the design team watched a competitor's chat UI tick out tokens like a typewriter, watched a user's shoulders relax when the first character appeared two hundred milliseconds in instead of after a four-second blank stare, and the decision was made: we stream. The pull request changed three files in the API gateway. The model output now flushes incrementally over Server-Sent Events. The launch went out on a Tuesday and the satisfaction score moved up by a measurable amount on a Wednesday. Nobody opened a ticket against infrastructure.

A month later the on-call engineer is staring at three dashboards that no longer agree with each other. The autoscaler is provisioning twice as many pods as the CPU graphs say it should need. The p99 latency dashboard is broken — not malfunctioning, but uninterpretable, because the histogram buckets stop at five seconds and most spans now live in the overflow. The capacity model that priced the previous quarter's bill said the service could handle twelve hundred requests per second per node. The graph in front of the on-call says it is handling four hundred and falling over.

The streaming response was a UX win at the wire and a stealth rewrite of every downstream contract the request path was originally engineered against. The team that shipped it did not know they had taken on infrastructure debt because nothing failed loudly. The bills, the dashboards, and the on-call rotation took the loss.

The Connection That Forgot How to End

Most L7 load balancers were tuned for the shape of traffic they have seen for a decade: short, bursty, stateless HTTP. The defaults assume a request opens a TCP connection, exchanges a few kilobytes of bytes in under a second, and the connection is recycled into a pool that another request will pick up moments later. Idle timeouts of thirty to sixty seconds are not a mistake — they are an optimization for that shape, freeing connection slots back to the pool before they accumulate.

Streaming inverts the shape. A single response now holds the connection open for the entire generation time, which for a long reasoning task or a multi-thousand-token answer is measured in tens of seconds. Worse, the connection is not even uniformly busy. The bytes flow in clumps — a token, a pause while the model runs its next forward pass, another token. From the load balancer's view this looks identical to a stalled client, and the default behavior is to terminate it.

The first symptom is a class of intermittent disconnects that the SDK retries silently. The second symptom is a connection pool that no longer recycles fast enough to absorb a traffic burst, because each open slot is now committed for the length of a model generation instead of the length of an HTTP request. The third symptom is the per-pod connection ceiling becoming the actual capacity bound — not CPU, not memory, but the file-descriptor budget the pod was provisioned with. A node that comfortably served a thousand short requests per second now serves three hundred streams because that is how many simultaneous connections it can hold without exhausting its socket table.

The remedy is not subtle but it is rarely applied before the first incident. The load balancer needs to be told that this is a streaming workload: idle timeouts extended to match the worst-case generation time, keepalive messages inserted at intervals shorter than the most aggressive proxy in the chain, concurrency-per-connection ceilings raised where the protocol allows it, and ideally a path through an HTTP/2 or SSE-aware proxy that multiplexes properly instead of holding a TCP slot per request. Without this, the team's reliability story is hostage to whichever middlebox in the chain has the shortest default.

Spans That Span More Than Bytes

The tracing system was instrumented to follow a request through the service mesh: open a span when the request enters, close it when the response ends. The span's duration becomes the operation's latency, and the latency feeds the percentile histograms that drive every dashboard, alert, and SLO. This worked for the years when "the response ends" was a single point in time at the end of a roughly-known interval.

For a streaming response the question of when it ends is genuinely ambiguous. Did the operation end when the model produced the last token? When the last byte cleared the wire? When the client acknowledged it? The answer matters less than the fact that whatever convention the instrumentation chose, the histogram buckets it was sized for no longer fit. Bucket boundaries that made sense for a service whose p50 was three hundred milliseconds and p99 was two seconds now have a p50 of four seconds and a p99 of fourteen. Most spans land in the overflow bucket, which means the percentile estimator's accuracy in the tail — exactly the regime the SLO cares about — degrades to a number the dashboard prints without context.

The structural fix is to recognize that a streaming request is not one operation, it is two. There is the request operation, which ends at first byte and whose duration is the time-to-first-token. And there is the stream operation, which begins at first byte and whose duration is the body-emission time. The first is a latency metric that maps cleanly onto user-perceived responsiveness and is the appropriate SLO for "did this thing respond." The second is a throughput-ish metric that maps onto model speed and infrastructure hold time, and is the appropriate SLO for "did this thing finish."

Conflating them produces dashboards whose percentile lines move for reasons the team cannot explain — a faster model that produces more tokens per second can make total span duration go up because users now ask longer questions, and a sequence of weeks where "latency got worse" can in fact be a sequence of weeks where the product got better. The team that does not split the span has built a measurement system that will mislead them at the next release.

The Autoscaler That Read the Signal Backwards

Every concurrency-based autoscaler operates on a model of what concurrency means. The default model assumes that a pod handling many concurrent requests is under load and a pod handling few is idle. This was a reasonable model when concurrent requests were a proxy for CPU pressure, because each request needed CPU to be served and held resources only briefly. The signal correlated with the underlying constraint.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates