Skip to main content

The AI Feature With Two Latencies: You Measure One, Your Users Feel the Other

· 9 min read
Tian Pan
Software Engineer

A traditional HTTP request has one latency that matters: the time from request to response. The p95 of that number is the contract. SRE watches it, the SLO is written against it, and when it regresses someone gets paged. One number, one dashboard, one truth.

A streaming AI feature broke that model the moment the response became a stream, and most teams haven't noticed. There are now two latencies, and they diverge. Time-to-first-token is how long the user stares at a spinner before anything happens. Time-to-completion is how long until the answer is fully written. They are shaped by different forces, fixed by different levers, and felt by the user at completely different emotional weights — and almost every team instruments only the second one, because that's the number the HTTP framework hands them for free.

The result is a dashboard that is technically accurate and experientially blind. Your p95 looks healthy. Your users are watching a four-second spinner. Both things are true at once, and the gap between them is the most common unmeasured failure mode in production AI features today.

Why The Framework Lies To You By Default

Open any web framework's request-tracing middleware. It timestamps when the request arrives and when the response finishes, and it reports the difference. For a JSON API that's exactly right. For a streaming endpoint it is the duration of the entire stream — which is to say, time-to-completion. The framework cannot see the moment the first token left the server, because as far as the transport layer is concerned the response is one long-lived connection that either succeeded or didn't.

So the default instrumentation measures the latency the user cares about least. Once the answer starts rendering, the user is reading; they have stopped checking whether the feature is broken. The agonizing part of the wait — the part where they wonder if they should refresh the page — happens entirely before the first token, and that interval is invisible on the standard dashboard.

This creates a specific and dangerous shape of metric. Time-to-completion is dominated by long answers. A request that produces 1,200 tokens of correct, useful output takes several seconds to finish, and that's fine — the user has been happily reading the whole time. But that multi-second completion time sits in the same histogram as the genuinely bad experience: a four-second stall before a single token appears. The p95 of the combined number gets pushed up by long-but-fine answers and pulled down by short stalls, and the genuinely broken experience is averaged into invisibility.

You can have a p95 time-to-completion of three seconds that contains zero unhappy users, and a p95 of three seconds that contains nothing but unhappy users. The number cannot tell you which. That's not a precision problem you fix with more decimal places. It's a sign you are measuring the wrong quantity.

Two Latencies, Two Different Physics

The reason you cannot derive one latency from the other is that they are produced by different machinery.

Time-to-first-token is gated by the prefill phase: the model has to ingest the entire prompt — system instructions, retrieved context, conversation history, the user's message — and compute attention over all of it before it can emit token one. Prefill cost scales with input size. A feature that stuffs 30,000 tokens of retrieved documents into context will have a slow first token no matter how fast the model generates afterward. Network round-trips, cold model routing, and queueing in front of the inference server all land in this bucket too.

Time-to-completion adds the decode phase on top: tokens are generated one at a time, each one cheap, and the total is roughly output length divided by generation speed. Completion cost scales with output size. A concise answer finishes fast; a verbose one finishes slow, and that has almost nothing to do with how snappy the feature felt.

These pull in opposite directions under load. Batching more requests onto a GPU improves throughput and decode speed — good for completion time — while lengthening the queue every request waits in before prefill, which hurts time-to-first-token. Optimize blindly for one and you can silently regress the other. A team that watches only completion time can ship a batching change that makes the dashboard look better and the product feel worse, and nothing in their monitoring will contradict them.

Human-perception research gives the thresholds that make this concrete. First-token latency under about 500ms feels responsive; past a second users notice; past two seconds frustration sets in. Inter-token speed has its own threshold — fast enough to outpace reading is enough; faster is invisible. These are different budgets for different metrics, and a single combined SLO cannot encode them.

The Patience Budget Is Not One Number

The deeper reason to split the metric is that the user does not have one patience budget. They have two, and they are wildly different sizes.

The patience budget for "is this thing even working" is tiny — a second or two. The patience budget for "is it finished yet" is large, because once tokens are streaming the user has feedback, can start reading, and is engaged in active rather than passive waiting. This is the same psychology that makes a skeleton screen feel ~30% faster than a spinner over an identical delay: visible progress converts dead time into tolerable time. Streaming itself is the biggest version of that trick. It does not make the answer arrive sooner; it makes the first sign of life arrive sooner, and it deliberately trades a slightly worse completion time for a dramatically better perceived experience.

That trade-off only makes sense if you measure both sides of it. A 10-second answer that starts rendering in 400ms feels fast. A 10-second answer that appears all at once after 10 seconds of silence feels broken. Same completion time. Opposite products. If your only metric is completion time, those two experiences are literally identical on your dashboard — and you have no way to know that you have shipped the second one.

So the SLOs have to be split. Time-to-first-token gets an aggressive target — sub-second if you can — because it is policing the small patience budget. Time-to-completion gets a looser target tied to answer length, because it is policing the large one. Some teams add a third, inter-token latency, to catch the stream that starts fast and then stutters. The point is not the exact numbers; it is that one threshold cannot serve two budgets.

What Splitting The Metric Actually Requires

Fixing this is mostly instrumentation discipline, and it is cheaper than it sounds.

Timestamp the first token explicitly. Do not infer latency from the request span your framework gives you. In the code path that consumes the model's stream, record the wall-clock moment the first chunk arrives and emit it as its own metric. This is a few lines of code, and it is the single highest-leverage change in this entire post. Everything else is downstream of having the number.

Track percentiles, never averages, and track them per metric. A system with a 200ms average first-token time can carry a p99 of three seconds — 1% of users waiting fifteen times longer than the average implies. Streaming UX is an outlier game; the tail is the experience. Keep p50, p95, and p99 for time-to-first-token and for completion time separately. The two histograms have different shapes and must never be merged.

Consider a goodput-style definition of success. Instead of one threshold, count a request as good only if it satisfies all of its constraints at once — first token under X, inter-token under Y, completion under Z. The fraction of requests that clear every bar is a single honest number that no long-but-fine answer can inflate.

Tune for the metric you can now see. Once first-token latency is visible, a whole class of optimizations becomes worth doing: caching the prompt prefix so the static instructions and examples skip the prefill phase entirely — which can cut first-token latency dramatically — keeping a smaller, faster model on the first hop, warming routes so requests don't hit a cold path, and trimming retrieved context that inflates prefill for marginal quality gain. None of these would ever justify themselves on a completion-time dashboard, because completion time barely moves when you fix them.

The Org Seam Underneath The Metric

The reason this gap persists is organizational, not technical. SRE owns the latency dashboard, and that dashboard was designed in the pre-streaming era when latency was a scalar. The AI team owns the perceived experience but often does not own the monitoring stack, so the number that matters to them never gets a panel. Nobody is wrong; the two halves of the truth simply live in two different orgs and never get disaggregated into the same view.

The fix is to make "latency" a banned word in AI feature reviews unless it is qualified. Not "what's our latency" but "what's our time-to-first-token p95 and our completion p95." The moment the question forces two numbers, the team that was hiding a four-second spinner behind a healthy aggregate has nowhere left to hide.

Latency stopped being a scalar the moment your response became a stream. A team that still reports a single number is reporting the average of two experiences its users never average — and quietly optimizing the one that matters least.

References:Let's stay in touch and Follow me for more thoughts and updates