The SSE Keep-Alive Your Reverse Proxy Stripped, And The Prompt You Paid For Twice
Your agent called a tool that took 35 seconds. During those 35 seconds, no tokens flowed from the model back to the browser. The provider's SSE stream was still open. Your tool was still running. The user's spinner was still spinning. And somewhere in the middle, a reverse proxy you do not control decided the connection had been quiet for too long, closed it, and your client's reconnection logic dutifully restarted the entire request from scratch.
The first response was 4,200 prompt tokens and 600 completion tokens. The second response was 4,200 prompt tokens and 600 completion tokens. The user got one answer. Your invoice got two.
This is the failure mode that the streaming guides never quite warn you about, because the streaming guides are written by people standing at the two ends — the provider and the browser — and the proxy chain in between is somebody else's problem until the day it becomes yours. The moment you put a real production network between your LLM call and your user, you inherit a contract you did not negotiate: every box on the path has its own opinion about how long a TCP connection can stay quiet, and those opinions are silently and unevenly enforced.
The silence between tokens is normal, and that is the problem
LLM streaming feels like a torrent of tokens, but at the byte level it is bursty. The model emits a token, then thinks, then emits another. Most of the time the gap is small. But the moment the assistant decides to call a tool — and the agent loop dutifully goes off to hit a downstream API, query a database, run a sandboxed script — the SSE stream goes idle. Not closed. Just idle. The provider is still holding the connection open, waiting to send the next event. Your client is still waiting to receive it. Nothing is wrong.
To every intermediate proxy on the network path, "nothing is wrong" looks identical to "connection has been abandoned." A reverse proxy cannot read the semantics of an SSE stream; it can only count the seconds since the last byte. If the gap exceeds its idle timeout, it does what it was configured to do: it closes the connection. From the provider's side, the socket suddenly RSTs. From your side, the EventSource fires onerror and, if you are using a standard reconnect library, immediately reopens the connection with a fresh request — which on a stateless LLM API means re-sending the entire prompt and being billed for the entire response a second time.
The defaults that bite you are everywhere and they are short. AWS ALB defaults to a 60-second idle timeout. Nginx's proxy_read_timeout defaults to 60 seconds. Cloudflare's free and pro plans enforce a 100-second cap that you cannot raise from a config file. AWS Service Connect has been observed cutting SSE streams at around 15 seconds because it does not treat SSE the way it treats WebSockets. The user-facing symptom is always the same: a stream that worked locally, worked in staging behind no proxy, and then mysteriously dies in production after exactly N seconds where N is the smallest idle timeout on the path.
Where the keep-alive disappears
The standard, well-documented mitigation is to send a heartbeat: an SSE comment line starting with a colon, sent every 15 to 30 seconds, that pushes a few bytes across the wire so no proxy can call the connection idle. The SSE spec explicitly accommodates this — a line beginning with : is a comment, ignored by the client's parser, but indistinguishable from real traffic to anything inspecting bytes.
The teams that get bitten are not the ones who forgot the heartbeat. They are the ones who assumed the heartbeat survived the trip. It often does not.
The provider sends a : keep-alive comment every 15 seconds. The first thing it hits is your edge — Cloudflare, CloudFront, an AWS ALB, or whatever sits at the front. That layer may be configured with proxy_buffering on (the Nginx default), which means it will accumulate response bytes in a 4-8 KB buffer and only flush downstream when the buffer is full or the upstream closes. A handful of colon-and-newline bytes will not fill that buffer for a long time. The provider is sending heartbeats. Your client is receiving silence. The next proxy down the line — the one closer to your client — counts the seconds of silence and closes the connection. The heartbeat existed; it was sitting in a buffer two hops upstream when the timeout fired.
This is why the canonical Nginx SSE configuration is six directives, not one. You disable buffering with proxy_buffering off, you disable caching with proxy_cache off, you raise the read timeout with proxy_read_timeout 3600s, you clear the connection header so upstream keep-alive does not interfere, you force HTTP/1.1 so chunked transfer works, and you turn off chunked transfer encoding transformation. Miss any one and the stream limps. Miss proxy_buffering off specifically and the user sees the whole response land at once at the end, which often gets misdiagnosed as "the model is slow" rather than "the proxy is hoarding."
The X-Accel-Buffering: no response header is the polite way to tell Nginx that the application knows what it is doing and please do not buffer this particular response, regardless of the global config. Setting it on the server is cheap and survives most proxy reconfigurations. It does nothing against Cloudflare, AWS ALB, or a corporate proxy your client traffic is going through, but it removes one common failure mode.
- https://dev.to/martin_palopoli/how-i-implemented-end-to-end-sse-streaming-from-llm-to-browser-through-nginx-4bjo
- https://oneuptime.com/blog/post/2025-12-16-server-sent-events-nginx/view
- https://smartscope.blog/en/Infrastructure/sse-timeout-mitigation-cloudflare-alb/
- https://www.oliverio.dev/blog/aws-service-connect-sse
- https://community.openai.com/t/api-billing-for-streaming-if-i-close-connection-midway/624323
- https://community.openai.com/t/streaming-interruption-billing-clarification-needed/928978
- https://github.com/ggml-org/llama.cpp/pull/20872
- https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html
- https://zylos.ai/research/2026-03-28-llm-output-streaming-token-delivery-architectures/
