Voice Agent Turn-Taking: The 250ms Threshold That Reshapes Your Architecture

May 10, 2026 · 11 min read

Software Engineer

Linguists who study turn-taking across languages keep arriving at the same number: the gap between speakers in casual conversation is roughly 200 to 300 milliseconds. Anything longer reads as hesitation, distance, or deference; anything shorter reads as interruption. That window is so tight that humans demonstrably begin formulating their reply before the other person finishes — listening and planning happen in parallel, not in sequence.

Voice agents that miss this window do not feel slightly slow. They feel wrong. A 700ms gap that nobody notices in a chat product feels like the agent is dim, distracted, or about to be interrupted by the user out of impatience. A 1.5-second gap and the user is already repeating themselves. Hitting the budget is not a polish task — it forces architectural choices that text agents never have to face, and those choices reshape how the whole stack is built.

The problem is not "make the model faster." The model is one of four serial systems on the critical path, and the model is rarely the largest contributor when you actually decompose the wall-clock. Voice stacks that hit a sub-second target end up looking less like a chat API and more like a real-time telephony system, with all the disciplines that implies: streaming everywhere, predictive inference, hard deadlines per stage, and a control plane that can cancel work the moment circumstances change.

Decomposing the budget

A useful target for natural-feeling voice interaction is 800ms end-to-end perceived latency, with a stretch goal under 500ms for the snappiest interactions. The natural turn-transition window is 200–300ms, but in practice anything below 800ms is acceptable to most users for task-oriented dialog, and the 200–300ms target only becomes critical for conversational interactions where the user is talking past the agent.

Inside that 800ms, four stages compete:

End-of-turn detection: deciding the user has finished speaking. Naive VAD with 500ms silence triggers eats 500ms by itself, before any inference begins. Modern semantic endpointing brings this down to 150–250ms.
STT finalization: locking the partial transcript into a stable form the LLM can act on. Streaming ASR systems like AssemblyAI's Universal-3 Pro report ~150ms p50 from end-of-speech to final transcript.
LLM time-to-first-token (TTFT): the model emitting its first generated token. This is the metric that matters; the rest of the response can stream in parallel with the TTS playback. Targets sit at 200–300ms with prompt caching, longer without.
TTS time-to-first-audio: synthesizing the first phoneme so the speaker actually starts moving. Streaming TTS systems hit 75–200ms; non-streaming systems wait for the full sentence and add 500–1500ms.

Add network transport on both ends — 30–50ms on a typical mobile connection, more on cellular — and you can see how the budget evaporates. Naive sequential execution adds these up: 500 + 150 + 300 + 150 + 80 = 1180ms before the user hears a single syllable. That is not a voice agent; that is a walkie-talkie.

Streaming changes the equation from sum to max

The architectural escape from that 1180ms is to stop running stages sequentially and start running them in parallel through streaming pipelines. STT emits partial transcripts as the user speaks. The LLM begins generating before the user finishes. TTS begins synthesizing before the LLM finishes. Each stage operates on a sliding window of the previous stage's output rather than waiting for completion.

The latency math shifts from VAD + STT + LLM + TTS to something closer to max(VAD, STT, LLM, TTS) — bounded by whichever stage takes longest, not by their sum. In a well-designed pipeline, the moment end-of-turn is detected, the LLM is already mid-generation against a high-confidence partial transcript and the TTS first-byte is hundreds of milliseconds away rather than seconds.

This pushes architectural pressure into places that text systems do not have to think about. Every stage needs a streaming protocol. Every stage needs to handle revision, because the partial it just received may change. Every stage needs an explicit cancellation primitive, because half the work it starts will be thrown away when the user interrupts or the upstream stage emits a different output.

Endpointing is the load-bearing decision

The single biggest lever in turn-taking quality is how the system decides the user has stopped speaking. Three approaches dominate, with sharply different latency-quality tradeoffs.

Pure VAD-on-silence measures audio energy and triggers end-of-turn after a fixed silence duration — typically 500–800ms. It is the simplest implementation and the worst experience: the agent waits for a full pause every time the user takes a breath, mistakes hesitation for completion when the user is mid-thought, and produces a cadence that feels like talking to a satellite phone.

Endpointing models trained on conversational speech improve on this by combining acoustic and lexical cues — pitch contour, rate-of-speech changes, lexical completeness — to predict end-of-turn faster than fixed silence thresholds. These typically sit in the 200–400ms range and are domain-tunable.

Semantic turn detection, the current state of the art, runs a small classifier over the partial transcript to decide whether the utterance is syntactically and semantically complete. It is the only approach that handles the spelling-out case correctly: when a user says "my number is two-five-five" and pauses, a VAD-based system fires end-of-turn after 500ms; a semantic model recognizes the utterance as incomplete and waits. Open-source projects like Pipecat's smart-turn and managed services like LiveKit's adaptive endpointing report 86%+ precision with sub-300ms latency under realistic conditions.

The architectural implication is that turn detection is no longer a single-signal classifier you slot in front of STT. It is a fusion problem combining acoustic VAD (fast, noisy), lexical endpointing (slow, precise), and semantic completion classification (slow, very precise) — and the orchestration of those signals is a meaningful piece of the stack.

Speculative generation: starting before the user finishes

Once you accept that endpointing has irreducible latency — even an excellent semantic detector adds 150–300ms — the next move is to do useful work during that window. Speculative LLM generation kicks off inference on the partial transcript when STT confidence crosses a threshold (typically 80–90%), and either commits or discards the result based on what the final transcript actually says.

When the speculation is right, the user perceives an instant response: the LLM had a 300ms head start, so by the time end-of-turn fires, the model is already generating tokens. When the speculation is wrong — the user said something different from what the partial implied — the work is discarded and the system pays a re-prompt cost of 300–500ms. Industry numbers suggest this nets out positive when speculation is correct 80%+ of the time, which is achievable in domains with predictable phrasings (customer service, scheduling, ordering) and harder in open-ended conversation.

The architectural cost is that the LLM call layer needs to be cancellable, idempotent, and instrumented for hit-rate. Without those, speculative generation is a 50/50 bet that wastes budget half the time. With them, it is the largest single improvement available short of using a smaller model.

Barge-in: tearing down what is already in flight

Naturalistic conversation is full of overlap. The user starts speaking before the agent finishes. The user emits a "mhmm" while the agent talks, which is a backchannel signal of attention and not an interruption. Sometimes the user begins clarifying mid-sentence and genuinely needs the agent to stop. Treating all of these the same is how voice agents end up either steamrolling users or stopping mid-word every time anyone clears their throat.

Barge-in handling is the part of the stack that distinguishes these cases and reacts to real interruptions in under 200ms. Concretely:

Echo cancellation so the agent's own audio playback does not register as user speech in the input stream. This is non-trivial when the agent's voice is running through a phone line with delay and codec artifacts.
Backchannel detection so short, low-energy utterances ("yeah", "uh-huh", "okay") that are typical conversational support don't trigger interruption. This usually combines duration thresholds with a small classifier.
Immediate TTS halt the moment a real interruption is confirmed, including dropping any audio buffered downstream of the synthesis stage.
In-flight LLM cancellation so the model stops generating tokens the user will never hear, which matters for both billing and for the next turn's coherence.
Heard-so-far reconstruction, telling the LLM in the next turn exactly how much of its previous response actually played to the user. Without this, the agent will assume the user heard the full response and reference content the user never received.

That last point is the one teams underestimate. The audio cancellation is mechanical; the conversational state recovery is where the agent feels intelligent or feels broken. An agent that says "as I mentioned, the total is forty dollars" when the user interrupted before the price was spoken has lost the conversational thread, regardless of how fast the audio cut off.

Voice stacks are real-time systems, not chat APIs

The deepest architectural shift is the mental model. Chat APIs are request-response with soft latency targets — slow is annoying but rarely unusable. Voice stacks are real-time systems with hard deadlines: miss the budget and the interaction is qualitatively broken, not quantitatively slower.

That shift pulls in disciplines from telephony and streaming media that the LLM ecosystem has spent two years happily ignoring:

Tail latency matters more than median. A p50 of 600ms with a p99 of 4 seconds will produce a usable demo and an unusable product. Voice users are exquisitely sensitive to outliers because every outlier feels like the agent froze. The instrumentation needs p99 SLOs per stage, not just averages.
Jitter is its own metric. Even if the median is fast, high variance reads as the agent being erratic. Sub-100ms standard deviation per stage is a reasonable target. Wide variance often comes from cold caches, GC pauses, or upstream provider variability — none of which show up in median dashboards.
Cancellation is a first-class primitive. Every stage needs to be interruptible at the millisecond level. Building on a model API that does not stream or cancel cleanly will cap the experience regardless of model quality.
Geographic colocation matters. A 200ms transatlantic round trip on the audio path eats most of the turn-taking budget. Voice products that ship internationally either deploy regional inference or accept a regional quality difference that is more pronounced than text products experience.

Teams that come from a chat-LLM background tend to underweight all of these, optimize median latency on a single happy-path scenario, and then discover the product feels broken under real-world conditions of cellular networks, background noise, and tail variance from third-party providers.

What this means for product strategy

There is a leadership-level realization buried in the engineering details: a voice product's perceived intelligence is bounded by its turn-taking reflexes long before its reasoning capability matters. A merely competent model with sub-500ms turn-taking feels smarter than a frontier model with 1.5-second pauses, because the lower-latency system maintains conversational flow and the higher-latency system breaks it on every turn.

This inverts the usual product strategy of pursuing model quality first. For voice, the right play is to lock down the real-time pipeline — endpointing, streaming, cancellation, barge-in — on a smaller model that meets the latency target, and only then explore whether a larger model can fit within the same budget. Teams that try to use the latest frontier model from day one tend to ship a slow voice agent and spend a quarter trying to make it faster, when the better path is to ship a fast voice agent on a mid-tier model and upgrade to a frontier model once provider TTFT improves.

The 250ms turn-taking threshold is not a polish target you reach in the last sprint. It is the constraint that determines which model you can use, which providers you can integrate with, which deployment topology you need, and which engineering disciplines your team has to learn. Teams that treat voice as "chat plus audio I/O" find this out slowly and painfully. Teams that treat voice as a real-time system from day one find that the 250ms target — and the architectural choices that follow from it — is what actually distinguishes a product from a demo.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Voice Agent Turn-Taking: The 250ms Threshold That Reshapes Your Architecture

Decomposing the budget

Streaming changes the equation from sum to max

Endpointing is the load-bearing decision

Speculative generation: starting before the user finishes

Barge-in: tearing down what is already in flight

Voice stacks are real-time systems, not chat APIs

What this means for product strategy

Recommended Reading

About Tian Pan

Decomposing the budget​

Streaming changes the equation from sum to max​

Endpointing is the load-bearing decision​

Speculative generation: starting before the user finishes​

Barge-in: tearing down what is already in flight​

Voice stacks are real-time systems, not chat APIs​

What this means for product strategy​

Recommended Reading

About Tian Pan

Decomposing the budget

Streaming changes the equation from sum to max

Endpointing is the load-bearing decision

Speculative generation: starting before the user finishes

Barge-in: tearing down what is already in flight

Voice stacks are real-time systems, not chat APIs

What this means for product strategy