Skip to main content

Voice AI in Production: Engineering the 300ms Latency Budget

· 10 min read
Tian Pan
Software Engineer

Most teams building voice AI discover the latency problem the same way: in production, with real users. The demo feels fine. The prototype sounds impressive. Then someone uses it on an actual phone call and says it feels robotic — not because the voice sounds bad, but because there's a slight pause before every response that makes the whole interaction feel like talking to someone with a bad satellite connection.

That pause is almost always between 600ms and 1.5 seconds. The target is under 300ms. The gap between those two numbers explains everything about how voice AI systems are actually built.

Why 300ms Is the Number

Human conversation has a natural response latency of 200-300ms. This is the gap between when one person finishes speaking and the other begins — fast enough to feel like genuine dialogue rather than turn-taking. Research consistently shows that pauses above ~400ms are perceptible, and pauses beyond 1.5 seconds fundamentally change the user's mental model from "conversation" to "query-response." Once that happens, no amount of voice quality improvements will fix the experience.

The problem is that building a voice AI pipeline requires chaining three components — speech-to-text (STT), a language model (LLM), and text-to-speech (TTS) — and each adds latency. A naive sequential implementation looks like this:

  1. Wait for user to finish speaking
  2. Send complete audio to STT → wait for transcript (150-500ms)
  3. Send transcript to LLM → wait for complete response (350ms-1s+)
  4. Send complete response text to TTS → wait for audio (100-500ms)
  5. Begin playing audio

Add those up and you're at 600ms to 2 seconds before the user hears a single word. This architecture is fine for a prototype. It's not a production voice system.

The Streaming Architecture Is Not Optional

The single biggest latency optimization in voice AI is not model selection, hardware, or network routing. It's eliminating sequential waiting by streaming across all three stages simultaneously.

In a streaming architecture, each stage starts before the previous one finishes:

  • Streaming STT emits partial transcripts while the user is still speaking, typically in 20ms audio chunks. The LLM starts processing the partial transcript long before the final word is spoken.
  • Streaming LLM sends tokens to TTS as they arrive, rather than waiting for the complete response.
  • Streaming TTS begins synthesizing audio from the first sentence fragment and streams audio chunks to the client while the LLM is still generating later paragraphs.

In practice, streaming STT saves 100-200ms, streaming TTS saves 200-400ms, and the overlap between LLM generation and TTS synthesis saves additional time on top of that. Combined, a well-implemented streaming architecture cuts 300-600ms from total end-to-end latency compared to batch processing.

This is why a budget breakdown is more useful than a single number. The 300ms target allocates roughly:

StageBudget
STT finalization50-100ms
LLM time-to-first-token100-200ms
TTS time-to-first-byte50-80ms
Transport (WebRTC)20-50ms

If any stage blows its budget, the user notices. The LLM is almost always the constraint — it's the hardest to optimize and sits in the middle of the chain.

Model Selection Is Constrained by Latency

Most teams initially pick LLMs based on quality benchmarks. For voice, you need to reframe this: the primary constraint is time-to-first-token (TTFT), not throughput or benchmark scores.

The spread across providers is large. In current production systems, fast models like Groq-hosted Llama variants achieve 50-100ms TTFT. gpt-4o-mini runs 120-200ms. Gemini Flash 1.5 lands around 300ms. GPT-4o sits at roughly 700ms. Frontier reasoning models — o3, Claude with extended thinking — are completely off the table for live voice loops; their latency is measured in seconds, not milliseconds.

This creates a real capability tradeoff. The models fast enough for voice tend to be smaller and less capable at complex reasoning, tool use, or nuanced dialogue. There's no free lunch: deploying a slower, more capable model requires other architectural compensations (aggressive streaming, hedging, constrained response length) to keep the perceived latency acceptable.

One pattern worth considering: LLM hedging, which runs two LLM calls in parallel and uses whichever returns a usable first token first. This cuts P99 latency significantly without improving average-case performance, but P99 is exactly what determines whether users describe the experience as "sometimes laggy."

The Turn Detection Problem Nobody Tells You About

Streaming architecture solves one latency source. Turn detection is a separate problem that teams frequently underestimate.

Voice Activity Detection (VAD) — detecting whether audio contains speech or silence — is the naive approach to deciding when a user has finished speaking. It works, but it introduces a latency floor: VAD typically requires 600ms of silence before triggering a response. On a 300ms budget, you've already used twice your budget before the LLM has seen a single character.

More subtly, VAD-only systems are plagued by false positives. Mid-sentence pauses, hesitations, and trailing vowels all look like silence to a VAD detector. The agent starts responding while the user is still forming a thought, which feels worse than being slow — it feels broken.

The production answer is semantic turn detection: a lightweight model that combines acoustic signals (pitch drop, energy drop, typical for question completion) with lexical signals (sentence-ending tokens, question markers) to determine conversational completeness. Semantic endpointing can bring true end-of-turn detection under 300ms while reducing false interruptions by ~45% compared to VAD-only approaches.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates