The Voice Agent SLO Defined in Time-to-First-Audio Your Provider Measured in Time-to-First-Token

June 2, 2026 · 10 min read

Software Engineer

The product spec says the user hears a response within 600 ms of finishing their sentence. The LLM provider's dashboard reports time-to-first-token at 280 ms. You are comfortably inside SLO on every chart you check. The user still complains the agent is laggy, and when you finally sit on a call yourself, there is a noticeable pause before the voice comes back — somewhere north of 600 ms, every time. The dashboard is not lying. It is measuring a number that does not include the TTS pipeline, the audio transport, or the jitter buffer on the receiving end. The 350 ms gap between the last token streamed and the first audio frame is real, it just is not on the LLM team's chart.

The bug is not in the model. The bug is in the SLO. It was defined at the wrong layer of the stack. The provider's egress is not the user's ear, and any latency contract that pretends otherwise will look healthy in production while the product feels broken.

Time-to-First-Token Is a Component Metric, Not a Product Metric

The reason TTFT became the default voice-agent latency SLI is not because it is the right metric — it is because it is the easy one. Every LLM provider publishes a TTFT number. Every observability vendor's LLM integration captures it out of the box. The number even has a credible threshold attached: under 400 ms feels responsive, under 250 ms feels instant. If you build an SLO around what your provider already measures, you ship a dashboard in a week.

The problem is that TTFT measures the moment the first token leaves the provider's API. It does not include the network hop back to your orchestrator, the buffering decision your stream parser makes before handing text to TTS, the TTS engine's own time-to-first-audio (TTFA), the WebSocket or HTTP transport from TTS back to your media server, the jitter buffer in the media pipeline, or the playout delay on the user's device. On a clean US-East deployment with a fast TTS provider, the gap between TTFT and what the user actually hears is somewhere between 200 and 500 ms. On a typical multi-vendor pipeline with cross-region hops, it routinely exceeds 600 ms.

The recent independent TTS benchmarks make the magnitude concrete. Cartesia Sonic 3.5 streams its first audio frame around 75–90 ms after receiving text. ElevenLabs Flash v2.5 lands near the same range on its real-time path. The Coval benchmark from May 2026 puts P50 TTFA at 188 ms for Sonic-3 and 264 ms for ElevenLabs Turbo v2.5. Those are best-case egress numbers, measured at the TTS provider's boundary — and they are already double or triple a fast LLM TTFT. By the time you add transport, jitter buffer, and codec framing, the bottom of the stack is not where the SLO assumed it was.

Each Layer Optimizes Its Own Metric and the Sum Owns the User

A voice agent is a relay race. ASR catches the user's speech and decides when the turn ended. The endpointer hands off to the LLM, which streams tokens. The TTS engine converts tokens to audio. The media server frames audio into RTP packets. The user's device buffers, decodes, and plays. Each handoff has an owner, and each owner has a dashboard.

The failure mode is uniform: every owner optimizes their own segment, the dashboards all turn green, and nobody owns the sum. The ASR team tunes endpoint detection down to a 200 ms silence threshold and books a win. The LLM team swaps to a smaller reasoning tier, drops TTFT to 200 ms, and books a win. The TTS team migrates to a streaming provider and brings TTFA down to 100 ms. The media team adopts a tighter jitter buffer. Each change is real. But the SLO was never expressed as the sum of these segments — it was a single component number on the LLM dashboard — so nobody noticed that the actual mouth-to-ear latency went from 1100 ms to 950 ms when it should have gone to 500 ms. The wins did not compose because there was no budget that allocated them.

This is the same architectural shape as a financial metric defined at the wrong currency conversion step. A revenue number reported in USD by a team operating in EUR is technically a number; it is just a number that has slipped off the contract with the business. A latency number reported at the LLM egress by a team selling a voice product is the same kind of mistake. The unit of measurement does not match the unit the user experiences.

The SLO Has to Live at the User-Perceptual Boundary

The fix is not to add more component dashboards. It is to redefine the SLO at the layer the user actually perceives, then derive component budgets from it.

Practitioners who have done this well converge on a small number of patterns:

Define the SLI as mouth-to-ear: the elapsed time from the user's end-of-utterance (the moment ASR's endpointer fires) to the first audio frame reaching the user's audio output. The NIST mission-critical voice methodology calls this the same thing and has been measuring it since long before LLMs entered the stack. Borrow the name. It is unambiguous.
Run synthetic conversation probes that measure the whole pipeline: a scheduled job that places a real call (SIP, WebRTC, whatever your production transport is), plays a pre-recorded utterance with a known acoustic marker at its tail, and records the agent's response audio. The latency is the offset between the marker in the played audio and the first non-silent frame in the recorded audio. This is the only measurement that includes every layer, including the ones your component dashboards omit by construction. SignalWire's recent piece on provider latency claims is direct about this: anything not measured at the audio loopback is a number that excludes whichever layer the vendor wants to exclude.
Allocate a per-segment budget that adds up to the SLO: if the SLO is 800 ms mouth-to-ear at P95, the budget might be 150 ms for ASR endpoint detection, 350 ms for LLM TTFT, 150 ms for TTS TTFA, 50 ms for transport, and 100 ms for jitter buffer and playout. Every component dashboard now reports both its raw number and its consumption of its budget. A regression in any single component is a regression against the user-facing SLO before it ships, not after.
Contract with the TTS layer on TTFA, not TTFB or "first chunk": TTS providers publish a confusing variety of latency numbers. Time-to-first-byte (the first byte of the response, which may be a header), time-to-first-chunk (which may be a small framing chunk before any actual audio), and time-to-first-audio (the first decodable audio frame). Only the last one is meaningful for a voice agent. If your TTS vendor publishes TTFB and you accept it as the latency contract, you are accepting an SLI that excludes the actual audio generation latency. Push back on the definition.
Treat the jitter buffer as part of the SLO, not as transport plumbing: a 100 ms jitter buffer is a 100 ms latency cost paid on every turn. If audio quality is good enough that you could shave 40 ms off the buffer, that 40 ms shows up directly in user-perceived latency. The media team will fight this because their SLO is audio quality; resolve the conflict by giving the buffer a budget under the latency SLO and the quality SLO simultaneously, and let them argue at design time instead of in production.

What This Looks Like Once Instrumented

The synthetic probe is the load-bearing piece. Without it, every other measurement is a partial story. A reasonable implementation runs once a minute against each production region, places a call through the same transport the real users use, plays a fixed prompt (a 2-second utterance with a 100 Hz sine tone at its very end as the timing marker), and listens for the agent's response audio. The probe records mouth-to-ear latency, captures the per-component timings via trace correlation, and emits both the SLI value and the per-segment consumption.

A good observability platform — LiveKit, Hamming, or any of the tracing-first voice platforms now appearing — will let you tag the synthetic call so it shows up as a distinguishable cohort in dashboards. The real production calls remain the ground truth; the synthetic probe is the SLO oracle. When the production cohort and the synthetic cohort diverge, you have learned something specific about traffic patterns. When they move together, the synthetic probe is doing its job.

The per-segment trace correlation matters because the moment the SLO is breached, the question is not "is the system slow" — that is what triggered the alert. The question is "which segment's budget did we blow." If the answer is TTS TTFA, the on-call page goes to the voice-engine team. If it is jitter buffer, it goes to the media team. If it is LLM TTFT, it goes to the model team. The synthetic probe gives the SLO; the trace correlation gives the runbook.

The Architectural Realization

Voice agents are the clearest example of a class of bug that recurs everywhere AI gets stitched into a pipeline: an SLO defined at the wrong layer of the stack passes while the product feels broken. The same mistake shows up in retrieval pipelines where the embedding model's latency is the SLO and the reranker's tail is what users feel. It shows up in agentic systems where the LLM TTFT is the SLO and the tool call is where the seconds live. It shows up in evaluation where the model accuracy is the SLO and the post-processing is where the regressions hide.

The pattern is consistent. A component publishes a metric. The team adopts it as the contract because it is easy to wire up. The metric stays inside SLO while the user-facing experience drifts. Eventually a customer complaint, or an executive sitting on a demo call, surfaces the gap, and the team learns it has been measuring the wrong thing for a quarter or more.

The defense is not to instrument more components. It is to define the SLO at the boundary the user actually crosses, and to derive component budgets from it rather than calling one component's metric a product SLO. For voice agents specifically, that boundary is the user's ear. The SLI is time-to-first-audio, measured by a probe that listens. The TTFT number on the LLM dashboard is a useful component diagnostic, but it is not the SLO, and any team that ships a voice product on a TTFT SLA is signing a contract with a unit that does not match what the user pays for.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Voice Agent SLO Defined in Time-to-First-Audio Your Provider Measured in Time-to-First-Token

Time-to-First-Token Is a Component Metric, Not a Product Metric

Each Layer Optimizes Its Own Metric and the Sum Owns the User

The SLO Has to Live at the User-Perceptual Boundary

What This Looks Like Once Instrumented

The Architectural Realization

Recommended Reading

About Tian Pan

Time-to-First-Token Is a Component Metric, Not a Product Metric​

Each Layer Optimizes Its Own Metric and the Sum Owns the User​

The SLO Has to Live at the User-Perceptual Boundary​

What This Looks Like Once Instrumented​

The Architectural Realization​

Recommended Reading

About Tian Pan

Time-to-First-Token Is a Component Metric, Not a Product Metric

Each Layer Optimizes Its Own Metric and the Sum Owns the User

The SLO Has to Live at the User-Perceptual Boundary

What This Looks Like Once Instrumented

The Architectural Realization