Skip to main content

One post tagged with "llm-streaming"

View all tags

The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller

· 11 min read
Tian Pan
Software Engineer

Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.

This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.