Skip to main content

Voice AI in Production: Engineering the 300ms Latency Budget

· 10 min read
Tian Pan
Software Engineer

Most teams building voice AI discover the latency problem the same way: in production, with real users. The demo feels fine. The prototype sounds impressive. Then someone uses it on an actual phone call and says it feels robotic — not because the voice sounds bad, but because there's a slight pause before every response that makes the whole interaction feel like talking to someone with a bad satellite connection.

That pause is almost always between 600ms and 1.5 seconds. The target is under 300ms. The gap between those two numbers explains everything about how voice AI systems are actually built.

Why 300ms Is the Number

Human conversation has a natural response latency of 200-300ms. This is the gap between when one person finishes speaking and the other begins — fast enough to feel like genuine dialogue rather than turn-taking. Research consistently shows that pauses above ~400ms are perceptible, and pauses beyond 1.5 seconds fundamentally change the user's mental model from "conversation" to "query-response." Once that happens, no amount of voice quality improvements will fix the experience.

The problem is that building a voice AI pipeline requires chaining three components — speech-to-text (STT), a language model (LLM), and text-to-speech (TTS) — and each adds latency. A naive sequential implementation looks like this:

  1. Wait for user to finish speaking
  2. Send complete audio to STT → wait for transcript (150-500ms)
  3. Send transcript to LLM → wait for complete response (350ms-1s+)
  4. Send complete response text to TTS → wait for audio (100-500ms)
  5. Begin playing audio

Add those up and you're at 600ms to 2 seconds before the user hears a single word. This architecture is fine for a prototype. It's not a production voice system.

The Streaming Architecture Is Not Optional

The single biggest latency optimization in voice AI is not model selection, hardware, or network routing. It's eliminating sequential waiting by streaming across all three stages simultaneously.

In a streaming architecture, each stage starts before the previous one finishes:

  • Streaming STT emits partial transcripts while the user is still speaking, typically in 20ms audio chunks. The LLM starts processing the partial transcript long before the final word is spoken.
  • Streaming LLM sends tokens to TTS as they arrive, rather than waiting for the complete response.
  • Streaming TTS begins synthesizing audio from the first sentence fragment and streams audio chunks to the client while the LLM is still generating later paragraphs.

In practice, streaming STT saves 100-200ms, streaming TTS saves 200-400ms, and the overlap between LLM generation and TTS synthesis saves additional time on top of that. Combined, a well-implemented streaming architecture cuts 300-600ms from total end-to-end latency compared to batch processing.

This is why a budget breakdown is more useful than a single number. The 300ms target allocates roughly:

StageBudget
STT finalization50-100ms
LLM time-to-first-token100-200ms
TTS time-to-first-byte50-80ms
Transport (WebRTC)20-50ms

If any stage blows its budget, the user notices. The LLM is almost always the constraint — it's the hardest to optimize and sits in the middle of the chain.

Model Selection Is Constrained by Latency

Most teams initially pick LLMs based on quality benchmarks. For voice, you need to reframe this: the primary constraint is time-to-first-token (TTFT), not throughput or benchmark scores.

The spread across providers is large. In current production systems, fast models like Groq-hosted Llama variants achieve 50-100ms TTFT. gpt-4o-mini runs 120-200ms. Gemini Flash 1.5 lands around 300ms. GPT-4o sits at roughly 700ms. Frontier reasoning models — o3, Claude with extended thinking — are completely off the table for live voice loops; their latency is measured in seconds, not milliseconds.

This creates a real capability tradeoff. The models fast enough for voice tend to be smaller and less capable at complex reasoning, tool use, or nuanced dialogue. There's no free lunch: deploying a slower, more capable model requires other architectural compensations (aggressive streaming, hedging, constrained response length) to keep the perceived latency acceptable.

One pattern worth considering: LLM hedging, which runs two LLM calls in parallel and uses whichever returns a usable first token first. This cuts P99 latency significantly without improving average-case performance, but P99 is exactly what determines whether users describe the experience as "sometimes laggy."

The Turn Detection Problem Nobody Tells You About

Streaming architecture solves one latency source. Turn detection is a separate problem that teams frequently underestimate.

Voice Activity Detection (VAD) — detecting whether audio contains speech or silence — is the naive approach to deciding when a user has finished speaking. It works, but it introduces a latency floor: VAD typically requires 600ms of silence before triggering a response. On a 300ms budget, you've already used twice your budget before the LLM has seen a single character.

More subtly, VAD-only systems are plagued by false positives. Mid-sentence pauses, hesitations, and trailing vowels all look like silence to a VAD detector. The agent starts responding while the user is still forming a thought, which feels worse than being slow — it feels broken.

The production answer is semantic turn detection: a lightweight model that combines acoustic signals (pitch drop, energy drop, typical for question completion) with lexical signals (sentence-ending tokens, question markers) to determine conversational completeness. Semantic endpointing can bring true end-of-turn detection under 300ms while reducing false interruptions by ~45% compared to VAD-only approaches.

Properly handling the inverse — barge-in detection (user interrupting the agent mid-response) — requires bidirectional plumbing. When a user interrupts, the system needs to cancel buffered TTS audio, clear the LLM generation queue, and restart the STT pipeline within tens of milliseconds. Systems that handle this correctly feel conversational. Systems that don't are the ones users describe as "not listening."

Transport Is Not a Free Variable

Teams optimizing STT, LLM, and TTS latency often forget that the audio has to travel across a network. The transport layer is not free.

WebRTC is the correct choice for browser and mobile applications, adding roughly 20-50ms of transport latency. Traditional PSTN (phone calls over carrier infrastructure) adds 150-700ms of network transit, mostly from carrier switch routing. That's a 300ms+ penalty for phone-based deployments that no amount of model optimization will recover.

Geographic co-location matters significantly. If your users are in Australia and your LLM provider's nearest datacenter is in Virginia, you're adding 200-300ms of round-trip latency before the model has processed a token. For international deployments, regional model inference or aggressive caching of common responses can compensate, but the fundamental constraint is the speed of light.

One underappreciated detail: WebSocket connections and TCP handshakes add latency on every request. Voice systems should use persistent connections throughout a call, not connection-per-request. The Opus audio codec with low algorithmic delay is the standard choice for voice pipelines — it adds minimal coding delay and degrades gracefully on packet loss.

Why "Voice Feels Weird" After You've Solved the Latency

Here's the failure mode that comes after shipping: latency is under 300ms, the transcription is accurate, the TTS voice sounds good in isolation — and users still report something feels off.

The issue is almost always prosody. Latency measures when audio starts playing; prosody determines how natural that audio sounds. Human speech has rhythm, stress, pitch variation, and intonation patterns that convey meaning beyond the words. LLM-generated text tends to be declarative and evenly structured in ways that don't translate naturally to spoken audio.

TTS systems have improved dramatically, but they're synthesizing text, not modeling how a person would naturally speak a given response in a conversational context. The result is responses that sound like a voice reading a document rather than a person talking.

Some approaches that have shown practical benefit:

  • Disfluencies: Prompting the LLM to occasionally include filler words ("um," "let me check on that") makes responses sound more natural and buys time for external API calls without feeling like a stall.
  • Conciseness constraints: Long LLM responses that were written for text don't work for voice. Short declarative sentences with natural conversational structure ("Got it. Looking that up now — one second.") fare better than paragraphs.
  • Punctuation as prosody signal: TTS systems use punctuation as a proxy for speech rhythm. Structured prompts that produce well-punctuated output consistently produce better-sounding audio than responses without explicit punctuation guidance.

External API Calls: The Latency Wild Card

Most production voice agents need to call external systems — databases, CRMs, scheduling APIs — during a conversation. This is where latency budgets get unpredictable.

The wrong pattern: issue the API call synchronously in the critical path, block LLM response generation until the call returns, and hope the call completes in under 100ms. In practice, external calls range from 50ms to 500ms+ and have high variance.

The production pattern: categorize tool calls by predictability. For calls that will almost certainly happen at call start (fetching account information, loading customer context), fire them at call initiation before the first user response. For calls triggered mid-conversation, use concurrency: acknowledge the request ("Let me pull that up") while the API call runs in parallel, generating filler response audio to cover the wait. If a call exceeds a latency threshold, return a bridging statement rather than silence.

The mental model is: treat external calls as events that need to be masked, not as synchronous operations that block the pipeline.

Measuring What Actually Matters

The standard metrics for each component — WER for STT, TTFT for LLM, time-to-first-byte for TTS — don't capture what users experience. End-to-end latency testing requires simulating actual call conditions: audio input through the real audio pipeline, full orchestration stack active, measuring from end-of-user-speech to start-of-agent-audio.

Per-component benchmarks are useful for debugging, not for judging production quality. The interactions between components — how quickly streaming tokens flow from LLM to TTS, whether the turn detection model creates unnecessary buffering, how barge-in interrupts the TTS queue — are where the real latency lives, and those only show up in end-to-end tests.

The metrics worth tracking in production:

  • P50 and P99 end-to-end latency (user speech end → first audio byte)
  • False interruption rate (how often the agent cuts off the user prematurely)
  • Barge-in handling success rate (how often user interruptions are correctly handled)
  • Turn detection latency (how long after semantic completion before agent responds)

P99 matters as much as P50 for voice — users remember the outlier experiences more than the average.

The Stack Decision

There are three architectural approaches to building voice AI, with different tradeoffs:

Cascading pipeline (batch): STT → wait → LLM → wait → TTS. Total latency: 800-2000ms. Useful for prototyping. Not acceptable for production.

Streaming pipeline (custom): Each stage streams to the next; full control over every component. Total latency: 200-500ms achievable. Requires significant orchestration engineering, but gives you the most control over latency, model selection, and failure handling.

All-in-one API (e.g., GPT-4o Realtime, Gemini Live): Single API handles ASR→LLM→TTS internally. Total latency: 500-1200ms. Faster to ship, less flexibility, limited control over individual stage optimization.

The right choice depends on your quality and control requirements. All-in-one APIs are appropriate when speed to market matters more than latency optimization or model selection flexibility. Custom streaming pipelines are appropriate when you need to hit sub-400ms targets, integrate domain-specific TTS voices, or use specialized LLMs.

The strategic observation: as all-in-one APIs improve, the gap narrows. For most teams in 2026, the question isn't whether to build custom vs. use an API — it's whether your latency requirements or customization needs are extreme enough to justify the engineering investment.

References:Let's stay in touch and Follow me for more thoughts and updates