Skip to main content

The Streaming Response That Returns 200 Then Fails: How Mid-Stream Errors Break Your SLOs

· 10 min read
Tian Pan
Software Engineer

Your availability dashboard says 99.95%. Your users say the answer stopped mid-sentence. Both are correct, and that is the problem.

The HTTP-era reliability stack was built on a single assumption: the status code arrives at the end of a request and summarizes its fate. A 200 means success. A 5xx means retry. The load balancer counts the ratio, the SLO dashboard aggregates it, the alerting fires on the burn rate. Every layer of that stack reads the header and trusts it.

Streaming inverts the assumption. The moment your server flushes the first token, it has already committed to a 200. Everything that goes wrong after that — a provider timeout at token 400, a content filter trip mid-paragraph, a dropped TCP connection, a malformed tool-call fragment — happens after the verdict has been rendered and cannot be retracted. The request failed. The status code says it succeeded. And nothing in your reliability tooling is built to notice the difference.

The Verdict Moved From the Header to the Body

In a classic request/response cycle, the server does all its work, decides whether it worked, and then writes a status line. The status code is a summary computed with full knowledge of the outcome. This is why retry middleware, circuit breakers, and error-rate SLOs all key off it — by the time you see the code, the story is over.

Server-Sent Events and chunked streaming break this sequencing. To stream, the server must flush response headers — including the status line — before it has generated the body. HTTP has no mechanism to amend a status code once bytes have gone out. So a streaming endpoint commits to "200 OK" at roughly the same instant it commits to starting, which is the one moment it knows the least about whether the request will actually finish.

This is not an LLM-specific quirk; any long-lived streaming response has it. But LLM features make it acute for three reasons. Generations are long, so there is a wide window for something to fail after the first token. They depend on an upstream provider whose own reliability you do not control. And the failure modes are diverse: rate limits, context-length overflows, safety filters, and inference-server crashes can all land in the middle of a stream rather than at its start.

The result is a class of request that is, from the protocol's point of view, indistinguishable from a success — and from the user's point of view, an obvious failure. Reliability engineers have a name for the gap between those two perspectives. It is called a blind spot, and streaming creates one by construction.

How the Failure Hides From Every Layer

Walk a 200-then-fail request through a typical stack and watch each layer record the wrong thing.

The load balancer sees a request that returned 200 and closed its connection. It increments the success counter. Whether the body contained a complete answer or three tokens and a stack trace is below its resolution — it reads headers, and the header said success.

The SLO dashboard consumes the load balancer's metrics. It computes availability as the ratio of non-error responses, and this request counted as non-error. The error budget is not debited. The burn-rate alert does not arm.

The retry middleware is waiting for a status code to react to. A 503 triggers a retry; a timeout before headers triggers a retry. But a 200 with a broken body triggers nothing, because the middleware's entire contract is "retry on error status," and there is no error status. The request that most needs a retry is the one the retry layer is structurally blind to.

The user sees a paragraph that stops mid-word, or a code block with no closing fence, or a tool call that never executed. They retry by hand — re-issuing the prompt, paying for the tokens again, waiting again.

And the eval-on-traffic pipeline, if you have one, does the most insidious thing of all. It samples the truncated output, runs it through a quality judge, and the judge — correctly — scores it as low quality. That score lands in your model-quality dashboard. You now have an infrastructure failure being logged as a model regression. A week later someone opens an investigation into why answer quality dropped, and they are debugging the wrong system entirely.

Five layers, five wrong conclusions, all of them traceable to the same root cause: every one of them reads the header, and the header is a fiction.

The Header Lies, So Make the Body Tell the Truth

If the status code can no longer carry the verdict, something inside the stream has to. The fix is an application-level completion protocol: the stream must end with an explicit terminal event that the client validates, and the absence of that event must be treated as a failure.

Concretely, the last thing a healthy stream emits is a sentinel — a done event, a final chunk with a known marker, OpenAI-style a response.completed event. The client's contract changes from "the connection closed, therefore we are done" to "we received the terminal event, therefore we are done." Those are not the same statement. A connection can close because the generation finished, or because a proxy timed out a long-lived connection, or because the inference server died. Only the terminal event distinguishes them. If the stream ends without it, the client knows the response is incomplete even though the transport reported a clean 200.

The terminal event should also carry a reason, because not all incomplete streams are equal. There is a real difference between stop (the model finished naturally), length (it hit the token cap), content_filter (a safety system truncated it), and a raw transport drop. One genuinely useful piece of practitioner advice: when a provider injects an error into a stream, it often does so as a chunk with HTTP 200 around it — some inference servers literally return "HTTP 200" with the error details embedded in a streamed chunk. Your parser has to inspect chunk contents, not just chunk delivery, or it will hand a serialized error object to your UI as if it were model output.

This terminal-event discipline is cheap to add and changes the entire downstream story, because now there is a signal that something other than the HTTP layer can key off.

A Mid-Stream Error Taxonomy, Because "Failed" Is Too Coarse

Once the client can detect an incomplete stream, it has to decide what to do — and that decision depends entirely on why the stream broke. Lumping every mid-stream failure into a single "error" bucket throws away the information you need to respond correctly.

A workable taxonomy has three branches:

  • Retryable transport faults — a dropped connection, a provider 503 surfaced mid-stream, an inference-server crash. These are safe to retry; the request was well-formed and the failure was incidental. They should debit the error budget and trigger an automatic re-request.
  • Non-retryable semantic trips — a content-filter truncation, a context-length overflow. Retrying the identical request reproduces the identical failure. These need a different path: surface a specific message, adjust parameters, or escalate — but do not silently re-roll.
  • Ambiguous truncations — the stream simply stopped with no terminal event and no error chunk. Treat these as retryable but cap the attempts and log loudly, because a high rate of ambiguous truncations usually means a proxy or CDN is buffering and timing out your long-lived connections, which is an infrastructure bug, not a model behavior.

The taxonomy matters because the retry decision is not binary. A team that retries everything will hammer the provider re-requesting content-filter trips that can never succeed. A team that retries nothing will make users manually re-issue prompts after transient blips. The branch you take has to be a function of the terminal event's reason code — which only exists because you added the completion protocol in the previous section.

Redefine the SLO Around the Terminal Event

None of this reaches your dashboards until you change what the SLO measures. As long as availability is computed from HTTP status codes, the 200-then-fail request will keep counting as green no matter how good your client-side handling gets.

The fix is to define the availability SLI on the terminal event, not the status code. A request is successful when the client receives a terminal event with a healthy reason. A request is a failure when the stream ends without one, or ends with a transport-fault reason. This number is emitted by the client or an edge layer that can actually see the end of the body — not by the load balancer, which structurally cannot.

This also lets you split a metric that streaming quietly turned into two. Time-to-first-token and time-to-completion are different latencies felt at different emotional weights, and a single "request duration" number averages them into mush. But there is a third dimension the status code never let you see at all: completion rate — the fraction of started streams that actually reach a healthy terminal event. A streaming feature can have excellent latency and a quietly terrible completion rate, and until the SLO keys off the terminal event, that number does not exist anywhere in your observability stack.

For the retry path, a resumption strategy is worth designing deliberately. The naive approach re-rolls the whole generation from scratch, which doubles cost and latency and produces a different answer for a non-deterministic model. The more advanced approach — increasingly supported by inference servers and resumable-stream libraries using last-event IDs — continues from the last good token, treating the already-streamed prefix as fixed. Resumption is more work to build, but for long generations it is the difference between a reconnect costing a few hundred tokens and costing the entire response twice.

What This Means for How You Run AI Features

The deeper lesson is that streaming did not just change a transport detail. It moved the request's verdict from a place all your tooling was designed to read — the header — to a place none of it was designed to read — the end of the body. Every reliability instrument you inherited from the pre-streaming era is now reading a field that no longer contains the answer.

If you ship a streaming AI feature, three things need to be true. The stream must emit an explicit terminal event, and the client must treat its absence as failure. The mid-stream failure taxonomy must distinguish retryable faults from semantic trips so the retry path does not waste requests or strand users. And the availability SLO must be computed from the terminal event, so a 200-then-fail debits the error budget like the failure it is.

Until those are in place, your green dashboard is not measuring reliability. It is measuring how often the first token flushed — and then quietly assuming the rest worked out.

References:Let's stay in touch and Follow me for more thoughts and updates