The AI Feature With Two Latencies: You Measure One, Your Users Feel the Other
A traditional HTTP request has one latency that matters: the time from request to response. The p95 of that number is the contract. SRE watches it, the SLO is written against it, and when it regresses someone gets paged. One number, one dashboard, one truth.
A streaming AI feature broke that model the moment the response became a stream, and most teams haven't noticed. There are now two latencies, and they diverge. Time-to-first-token is how long the user stares at a spinner before anything happens. Time-to-completion is how long until the answer is fully written. They are shaped by different forces, fixed by different levers, and felt by the user at completely different emotional weights — and almost every team instruments only the second one, because that's the number the HTTP framework hands them for free.
