SSE vs WebSockets vs gRPC Streaming for LLM Apps: The Protocol Decision That Bites You Later
Most teams building LLM features pick a streaming protocol the same way they pick a font: quickly, without much thought, and they live with the consequences for years. The first time the choice bites you is usually in production — a CloudFlare 524 timeout that corrupts your SSE stream, a WebSocket server that leaks memory under sustained load, or a gRPC-Web integration that worked fine in unit tests and silently fails when a client needs to send messages upstream. The protocol shapes your failure modes. Picking based on benchmark throughput is the wrong frame.
Every major LLM provider — OpenAI, Anthropic, Cohere, Hugging Face — streams tokens over Server-Sent Events. That fact is a strong prior, but not because SSE is fast. It's because SSE is stateless, trivially compatible with HTTP infrastructure, and its failure modes are predictable. The question is whether your application has requirements that force you off that path.
How Each Protocol Actually Fails
Latency benchmarks for SSE, WebSockets, and gRPC streaming are essentially identical at the message rates LLM applications produce. At 50–100 tokens per second, you're nowhere near the throughput ceiling of any of these protocols. What matters instead is the failure-mode profile.
SSE fails at the proxy layer. When an Nginx or CloudFlare proxy sits between your server and client, its default behavior is to buffer the response until it receives a Content-Length header or the connection closes. SSE sends neither. The proxy waits, your client sees nothing, and the first byte arrives after the entire generation is complete — turning streaming into batch delivery. The fix is explicit: set proxy_buffering off in Nginx, add X-Accel-Buffering: no as a response header, and send heartbeat messages (data:\n\n) every 15–30 seconds to prevent idle timeouts. CloudFlare's default 100-second timeout will inject an HTML error page directly into your event stream if any silence exceeds it. This is a live production failure, not a hypothetical.
WebSockets fail at the backpressure layer. The protocol has no built-in mechanism for a server to detect that a client is consuming messages slower than they're being sent. If a client on a poor mobile connection can only process 1KB/sec while the server pushes at 100KB/sec, the server's TCP send buffer fills, the OS socket send queue grows, and memory accumulates per-client without bound. A single slow client on a busy server can cause latency spikes visible to every other client sharing the same network interface. Unlike SSE where the browser will eventually throttle at the HTTP layer, WebSocket implementations push this responsibility entirely to the application. You need to actively monitor socket.bufferedAmount, implement per-client write timeouts, and cap buffer sizes — or you will hit this under load.
gRPC streaming fails at the browser boundary. HTTP/2's built-in flow control means backpressure is handled correctly for service-to-service communication. But browsers cannot express bidirectional gRPC streams: the Fetch API buffers the entire request body before sending, and the browser's HTTP/2 implementation doesn't expose the hooks gRPC requires. Client streaming and bidirectional streaming are not supported in any stable browser as of 2026. Unary calls and server streaming work, but if you spec'd out a design requiring clients to send messages upstream during a streaming response, you will hit this wall after your unit tests pass.
The Browser Constraint That Most Teams Hit Too Late
Under HTTP/1.1, browsers limit concurrent connections to any single domain to six. This includes EventSource (SSE) connections. Open a second tab streaming from the same domain, and you've consumed two of those six slots. This is not a browser bug — it's a spec decision that Chrome and Firefox have marked "won't fix" because HTTP/2 makes it irrelevant. Under HTTP/2, multiplexing means you can have effectively unlimited concurrent SSE streams over a single connection.
The practical problem: many teams build behind HTTP/1.1 during development, discover the six-connection ceiling in production when agents run multiple parallel tasks, and then switch to WebSocket to escape it — without realizing HTTP/2 would have solved it. If you're on modern infrastructure (Vercel, CloudFlare, AWS ALB) and serving over HTTPS, you're likely already on HTTP/2. If you're not certain, check.
WebSocket has no corresponding connection limit. gRPC-Web has no corresponding connection limit. But reaching for either to escape an HTTP/1.1 constraint is solving the wrong problem.
What Edge Proxies Break and How
The typical LLM serving stack has at least two proxy layers: a reverse proxy like Nginx inside the infrastructure boundary, and a CDN or edge layer like CloudFlare in front of it. Both break streaming in their own ways.
Nginx proxies buffer upstream responses by default. For SSE, you need:
proxy_buffering off;
proxy_cache off;
proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_read_timeout 86400s;
The proxy_read_timeout matters more than most people expect. It defines how long Nginx will wait for data from the upstream server before closing the connection. Default is 60 seconds. A slow LLM generation that pauses mid-response for 61 seconds will be silently dropped.
For WebSocket, the upgrade headers must be passed:
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
For gRPC, Nginx has native support but requires its own directives:
grpc_read_timeout 86400s;
grpc_send_timeout 3600s;
CloudFlare imposes its own timeout at 100 seconds for connections that appear idle. A server that holds an HTTP connection open while computing a complex agent task will hit this timeout. The error it returns is raw HTML — which, when inserted into an SSE event stream mid-flight, will corrupt every event parser that expects data: prefixed lines. The mitigation is either heartbeat messages or moving to CloudFlare Workers, which handle SSE natively without the proxy timeout.
Enterprise networks compound this. Corporate firewalls often proxy all outbound traffic. A proxy that doesn't recognize chunked transfer encoding without Content-Length may queue SSE streams entirely. Long-lived WebSocket connections may be disconnected by firewalls with idle session limits. Agents connecting from behind corporate proxies will experience connection drops that look like server errors.
The Reconnection Storm Nobody Plans For
WebSocket connections break. Mobile users switch networks, laptops sleep, corporate VPNs reconnect. When connections break, clients reconnect. In a small application with hundreds of users, this is invisible. At scale with tens or hundreds of thousands of concurrent users, a correlated reconnection event — a brief server restart, a DNS hiccup, a CDN blip — can produce what's called a thundering herd.
Each WebSocket reconnection requires a TCP handshake, a TLS negotiation, and an HTTP upgrade round-trip. TLS negotiation is CPU-intensive: a modern server core can handle roughly 2,000 TLS handshakes per second. An 8-core server under 100,000 simultaneous reconnections needs over six seconds of pure TLS computation before serving any application traffic. During those six seconds, clients that were connected see degraded latency. New reconnection attempts queue. The effect cascades.
The fix is exponential backoff with jitter on the client side. This is not optional infrastructure polish — it is a correctness requirement for any WebSocket-based system that handles reconnection. SSE benefits from this too (the browser's EventSource has built-in reconnection, but no jitter), but the thundering herd problem is less severe because SSE connections are stateless and cheaper to re-establish.
gRPC in the Browser: The Current Honest Status
gRPC-Web is not gRPC. It's a translation layer that adapts gRPC to browser-safe HTTP. For unary calls and server streaming, it works. For client streaming and bidirectional streaming, it doesn't work in browsers without an Envoy proxy mediating the connection.
The WHATWG Fetch specification has a duplex: 'full' flag that would enable the Fetch API to both send and receive simultaneously — which would unblock bidirectional streaming. It exists in the spec. No stable browser ships it.
Buf's Connect protocol is the practical alternative for teams that want gRPC semantics in the browser today. It uses standard HTTP rather than gRPC's HTTP/2-specific wire format, making it compatible without Envoy, debuggable in browser DevTools, and capable of supporting server streaming with a significantly smaller client bundle. If your team is building a browser application that needs streaming and already has gRPC services, Connect is worth evaluating before you build an Envoy dependency.
For service-to-service communication with no browser involvement, gRPC streaming is the correct choice. HTTP/2 flow control handles backpressure. Binary Protocol Buffers encoding is efficient. Connection multiplexing means many concurrent logical streams can share a single physical connection. For microservice architectures where latency between services matters, gRPC typically reduces it from hundreds of milliseconds (REST polling) to under 50ms.
How to Actually Choose
The decision tree is shorter than most articles make it seem.
Start with SSE if your server streams tokens to a browser client and the client only needs to receive. This is the 90% case for LLM applications. Configure your proxies explicitly, add heartbeat messages, ensure you're on HTTP/2. SSE's simplicity is an advantage: stateless servers, no session management, works with serverless functions and edge runtimes.
Move to WebSocket when the client needs to send signals during an active generation — cancellation, mid-stream steering, tool call approval, agent handoff coordination. The handshake overhead (~150ms) is the cost, and the connection state management is the operational burden. Implement backpressure monitoring and exponential backoff reconnection before you ship.
Use gRPC streaming for service-to-service streaming in microservice architectures. Don't use gRPC-Web in browsers without Envoy, and evaluate Connect protocol if you need browser clients with bidirectional semantics.
Avoid architectural mismatch. WebSocket to escape SSE's HTTP/1.1 connection limit is probably the wrong fix. gRPC-Web when you need browser bidirectionality will fail in production. The failure you'll encounter is determined by the mismatch between what you assumed the protocol could do and what it actually does in your infrastructure.
What Your Failure Mode Tells You
If users report that LLM responses appear all at once rather than streaming: your proxy is buffering. Fix the proxy configuration before changing the protocol.
If your WebSocket server's memory grows over time under load and restarts solve it temporarily: you have unbounded write buffers accumulating for slow clients. Add buffer caps and per-client write timeouts.
If your SSE streaming works in development but fails behind corporate proxies: add heartbeat messages and investigate whether the proxy requires Content-Length. Some corporate proxies reject chunked encoding entirely.
If you're building an agent workflow where users need to cancel in-flight tool calls: SSE will not work, because you need an upstream channel from client to server. This is the legitimate case for WebSocket. If you're building that same agent workflow and users can wait until the agent proposes an action: SSE works fine with a separate HTTP endpoint for approval/rejection.
The protocol is not the product. It's the load-bearing infrastructure that your product sits on. The teams that make this decision well are the ones who test their entire stack — reverse proxy, CDN, mobile network conditions — before committing. The teams that make it badly discover what their protocol assumes about the network only when their users find the assumption violated.
Conclusion
SSE dominates LLM streaming because it matches what LLM generation actually is: a server pushing tokens to a waiting client, with no upstream messages during the stream. WebSocket's bidirectionality is genuinely necessary only when clients need to interact during generation — which is increasingly common in agent workflows but not universal. gRPC streaming is the right answer for service-to-service communication where typed contracts and flow control matter more than browser compatibility.
The protocol itself rarely causes the production failure. The proxy configuration, the reconnection behavior, the backpressure handling, and the infrastructure assumptions beneath the protocol cause the failure. Pick the simplest protocol that matches your actual communication pattern, then harden the surrounding infrastructure. The specification is the easy part.
- https://procedure.tech/blogs/the-streaming-backbone-of-llms-why-server-sent-events-(sse)-still-wins-in-2025
- https://softwaremill.com/sse-vs-websockets-comparing-real-time-communication-protocols/
- https://skylinecodes.substack.com/p/backpressure-in-websocket-streams
- https://grpc.io/blog/state-of-grpc-web/
- https://buf.build/blog/connect-web-protobuf-grpc-in-the-browser/
- https://community.cloudflare.com/t/using-server-sent-events-sse-with-cloudflare-proxy/656279
- https://oneuptime.com/blog/post/2025-12-16-server-sent-events-nginx/view
- https://piehost.com/websocket/performance-and-scalability
- https://medium.com/draftkings-engineering/lessons-learned-websocketapi-at-scale-604617a54cdb
- https://medium.com/amit-tiwari/server-sent-events-the-forgotten-http-spec-powering-every-llm-interface-f6264d0e4705
