Skip to main content

Where You Defined 'First Token' Decided Whether Your Latency SLO Was Real

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a reasoning-tier upgrade on a Tuesday and started getting support tickets on Wednesday. Users were saying the assistant felt "broken," "frozen," "hung." The on-call engineer pulled up the latency dashboard and found nothing unusual. p99 first-token latency was 612 ms — comfortably under the 800 ms SLO that the team had spent a quarter establishing. The dashboard was green. The phone was ringing.

The bug turned out to be a single instrumentation decision made fourteen months earlier, before reasoning models existed in production. The metric labeled "first token" measured the timestamp on the first chunk emitted by the provider. After the upgrade, the first chunk was a reasoning token — invisible to the user, never rendered, but counted as "first" by the SLO. The model was emitting four to seven seconds of internal thoughts before the first user-visible character streamed. Every chart stayed green. Every user waited in the dark.

This is not a story about a bad metric. The metric was correct for the model it was designed against. It is a story about what happens when the boundary you instrumented stops being the boundary your users feel — and how dangerously easy it is to ship that drift without noticing.

"First Token" Is Three Different Numbers Now

Before reasoning models, "first token" had one meaning: the first character of the model's response. The provider boundary, the streaming boundary, and the user-visible boundary all coincided. You could instrument any of them and get the same answer.

Reasoning models split that single number into three:

  • First emitted token — the first chunk the model produces, which may be an internal reasoning token never shown to the user. Some providers expose these as a separate stream; others fold them into the same SSE channel as the answer, distinguished by a field on the chunk.
  • First visible token — the first character that actually appears in the user's chat window. On a non-streaming UI, this is the moment the response card flips from loading to populated. On a streaming UI, it is the first character your render pipeline writes to the DOM.
  • First useful token — the first character of substantive answer content, after preamble like "Let me think about this." or "I'll check three things." Many teams ignore this distinction, but it matters for tasks where the preamble is a placeholder and the user is waiting for the actual answer.

A latency SLO promises a user experience. Whichever of these three numbers you instrumented at the time you wrote the SLO is the number the SLO is actually about. If you wrote "p99 first token under 800 ms" against a non-reasoning model, you implicitly defined first token as user-visible — because that was the only meaning available. Then a reasoning model arrived and changed which number your dashboard was reading. The SLO text did not change. The promise it made did.

The Provider Boundary Is the Wrong Place to Measure UX

When teams set up LLM observability, the easiest instrumentation point is the SDK callback. The OpenAI client gives you a hook on the first streaming chunk. The Anthropic SDK exposes the same. Most observability vendors auto-instrument this hook, label the metric gen_ai.server.time_to_first_token, and call it done.

This is the right metric for an infrastructure question — how fast is the provider responding? — and the wrong metric for a product question — how fast does our app feel? The two questions used to have the same answer. They no longer do, and the gap is widening.

Three reasons the provider boundary is the wrong UX measurement layer:

  1. The first chunk may be a reasoning token. Reasoning APIs typically emit thinking content as part of the same stream, tagged with a content type. Naive instrumentation timestamps the first chunk regardless of whether you render it. A reasoning model can comfortably spend 4–15 seconds emitting hidden thoughts before the first answer token. Your dashboard records that as a normal request because the first chunk arrived on time.
  2. Your render pipeline adds latency the API never sees. Streaming chunks pass through SSE buffers, optional reverse proxies, your frontend's chunk handler, React's state batching, and the browser's paint loop. The gap between "chunk received by your edge" and "character visible to the user" is small but non-zero — 50 to 300 ms in typical setups, longer if you batch updates or if a service worker is in the path.
  3. Preamble pollutes the metric without polluting the experience evenly. If you prompt the model to emit "Let me think..." before reasoning, the first visible token now arrives in milliseconds, and your SLO looks great. But the useful first token is still seconds away. A user who hits send and reads "Let me think..." for eight seconds feels exactly the same as a user who stares at a blank screen for eight seconds. Your SLO says you fixed it.

The instrumentation point determines what kind of promise you can make. SDK callbacks promise that the network worked. Edge timers promise that your infrastructure worked. Render-time markers promise that the user got something. Only the last one is a UX SLO.

How Roadmaps Quietly Break the Definition

The team I described did not change their SLO. They did not change their instrumentation. They changed which model they routed traffic to, and that change reached into a definition they did not realize they had made.

Several common roadmap moves trigger the same drift:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates