The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller
Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.
This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.
Streaming Throughput Is a Metric the Provider Defines
The contract you signed for "tokens per second averaged over the response" sounds like a fixed quantity. It is not. The denominator is fixed, but every other parameter is a knob the provider can turn. Chunk size, burst shape, per-chunk overhead, when the first token leaves the prefill, how aggressively decode catches up after a queue spike — all of these affect the experience without affecting the averaged number.
The cleanest version of the trick is bursting at the head. The provider drains a buffered batch of tokens in the first 200ms of the response, then trickles the long tail at a slower decode rate while still hitting the average. To the consumer the response feels frontloaded, then stalls. The wire metric does not see the stall because the wire metric is averaged across the whole response.
A subtler version is reducing the chunk granularity. Yesterday a chunk carried four tokens. Today the same backend change that improved their batching efficiency under load also flattened the SSE event stream into one-token chunks. Your renderer was paying about 2ms of overhead per chunk — parse, mutate state, sanitize markdown, run the highlighter, recompute layout. That cost was invisible at four tokens per chunk and crushing at one token per chunk. The provider's throughput metric did not move; your frontend's frame budget collapsed.
The headline-metric problem is structural. Anyscale and BentoML both publish guidance treating inter-token latency variance as the property that determines whether streaming "flows" or "stutters." Yet TPS is the number that shows up on the SLA, because it is the number the provider can promise across a heterogeneous fleet. The contract optimizes for what is contractible, not for what is felt.
The Cost Profile Of A Smaller Chunk
A single-token SSE chunk is not just a smaller piece of the same operation. It is its own operation, with its own fixed costs.
On the wire, every chunk carries an HTTP/2 frame header, an SSE event delimiter, and the JSON envelope ({"choices":[{"delta":{"content":"..."}}]}). The envelope is often longer than the token it carries. A four-token chunk amortizes that overhead across four payload bytes; a one-token chunk pays it four times.
On the consumer, every chunk triggers the decode-and-render loop. Even a well-built React frontend that batches updates with requestAnimationFrame cannot make the per-chunk parse, sanitize, and markdown re-walk free. Streaming markdown parsers exist precisely because the alternative — re-parsing the full document on every chunk — was already untenable. Shrinking the chunk size triples the call rate into that parser without changing what it has to produce.
On mobile devices the picture is worse. The same one-token cadence that a desktop browser smooths to 30fps will cause a mid-range Android device to drop frames, because each chunk is doing wakeup, parse, layout, paint — and the paint is the part the user actually sees. The provider's TPS dashboard does not know your fleet has phones on it.
And in the trace store, every chunk is a span. If you log per-chunk SSE events for debugging — and most teams do for the first six months — a fourfold increase in chunks is a fourfold increase in trace storage cost without a corresponding increase in observed information. The line item that quietly tripled is the one nobody is looking at.
How The Backend Tuning Sneaks In
There is no announcement. The provider is not trying to degrade your experience. The pattern is mundane: a serving-side change improved throughput at high concurrency by emitting tokens as they cleared decode rather than waiting to coalesce them. The throughput number on their side went up. Their P99 inter-token latency improved at the 95th percentile of load. They shipped it.
What they did not measure is the consumer-side cost. Their benchmark fleet is a load generator that reads bytes off a socket and discards them. That generator does not care whether the bytes arrive in chunks of one or chunks of four. Yours does, because yours has a renderer.
The same dynamic shows up when providers introduce typed event lifecycles for new content types. When Anthropic added thinking blocks to streaming responses, the wire format started carrying additional start and stop events around each content block. The total payload grew; the meaningful tokens did not. Consumers who hand-wrote SSE parsers based on what the wire looked like in 2024 woke up to silent breakage. The provider's SLA was unaffected.
The lesson is not that providers are adversarial. The lesson is that the provider's optimization surface and the consumer's perception surface are different surfaces, and the metric in the contract sits on the provider's side. Anything that moves on the consumer's side and does not show up on the provider's dashboard is invisible to the people in a position to change it.
A Perceptual SLO The Consumer Owns
- https://www.spheron.network/blog/llm-inference-slo-ttft-itl-latency-budget-guide-2026/
- https://zylos.ai/research/2026-03-28-llm-output-streaming-token-delivery-architectures/
- https://techcommunity.microsoft.com/blog/azuredevcommunityblog/the-importance-of-streaming-for-llm-powered-chat-applications/4459574
- https://developer.chrome.com/docs/ai/render-llm-responses
- https://www.sitepoint.com/streaming-backends-react-controlling-re-render-chaos/
- https://medium.com/@QuarkAndCode/llm-streaming-latency-cut-ttft-smooth-tokens-fix-cold-starts-f2be60d26b89
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://benchlm.ai/llm-speed
- https://artificialanalysis.ai/leaderboards/models
