The Voice Agent SLO Defined in Time-to-First-Audio Your Provider Measured in Time-to-First-Token
The product spec says the user hears a response within 600 ms of finishing their sentence. The LLM provider's dashboard reports time-to-first-token at 280 ms. You are comfortably inside SLO on every chart you check. The user still complains the agent is laggy, and when you finally sit on a call yourself, there is a noticeable pause before the voice comes back — somewhere north of 600 ms, every time. The dashboard is not lying. It is measuring a number that does not include the TTS pipeline, the audio transport, or the jitter buffer on the receiving end. The 350 ms gap between the last token streamed and the first audio frame is real, it just is not on the LLM team's chart.
The bug is not in the model. The bug is in the SLO. It was defined at the wrong layer of the stack. The provider's egress is not the user's ear, and any latency contract that pretends otherwise will look healthy in production while the product feels broken.
