Skip to main content

Why Your Voice Agent Feels Rude: Turn-Taking Is a Latency Budget You Never Wrote Down

· 11 min read
Tian Pan
Software Engineer

The first time you ship a voice agent, you'll get the same complaint twice: "It interrupted me," and "It feels rude." Both are the same bug. The agent isn't impolite — it's running on a latency budget you never wrote down. The chat-style instinct that says "respond when complete" produces a system that, in voice, feels like talking to someone who keeps stepping on your sentences and going silent at all the wrong moments.

Conversational turn-taking in humans happens in a window of roughly 100 to 300 milliseconds, and it does so across every language ever measured. A median 200ms inter-speaker gap isn't an aspiration; it's the baseline humans calibrate against. Anything slower reads as confusion, anything faster reads as interruption, and a voice agent that doesn't model the rhythm explicitly is going to land in one bucket or the other on every turn.

The fix isn't a faster model. It's accepting that voice AI is a soft real-time system whose budget is set by human conversational physics, and writing the budget down before you ship.

The Time-to-First-Token Lie

Every chat-AI team has internalized time-to-first-token (TTFT) as the latency metric that matters. In a chat product, a 600ms TTFT feels snappy because the user's eyes are still on the input box and the first word arriving anywhere in the next second registers as "fast."

Voice breaks this completely. The user has stopped speaking. There is no input box to look at. There is silence, and silence in a voice channel is a load-bearing signal. Past about 300ms of dead air, the user starts to wonder if the system heard them; past about 1.5 seconds, they assume something is broken and either repeat themselves or hang up. The same TTFT that delights chat users tanks voice satisfaction because the affordance is missing — there is no spinner, no shimmer, no typing dot to occupy the gap.

The trap is that the chat dashboard will look fine the entire time the voice product is failing. The model is responding within SLA. The 95th-percentile TTFT is healthy. And the support team is fielding calls about the "weird, robotic agent that keeps cutting people off." TTFT was never the right metric for voice; it just happens to be the one your inference platform exposes.

Decompose the Budget Before You Optimize It

The single useful change a voice team can make is to stop tracking end-to-end latency as one number and start tracking the four-part budget that actually composes the user-perceived experience:

  • VAD detection (50–80ms): how long after the user stops talking before the system commits a turn-end.
  • ASR partials and finalization (150–200ms): how long until the transcript is stable enough to ship to the model.
  • Model TTFT (300–500ms with a streaming pipeline; 250–500ms with a native-audio model): how long until the first response token arrives.
  • TTS first-byte audio (100–150ms): how long from the first model token to the first audible syllable.

These add up. A target of 600 to 800 milliseconds end-to-end is the conversational ceiling — past that, the agent feels slow even if every individual stage is "fast." If you don't decompose the budget, you'll spend three sprints optimizing whichever stage your tracing happens to highlight, which is rarely the worst offender. The team that owns the VAD has no incentive to learn that the bottleneck is TTS first-byte; the team that owns the model is convinced the problem is the network.

The decomposition also exposes the asymmetric optimizations that actually move the needle. Pre-warming TTS while the model is still streaming. Speculatively starting ASR on the first frame of voiced audio rather than after VAD commits. Streaming partial transcripts to the model so its context is warm before the user's turn ends. None of these are visible if your dashboard reports a single end-to-end number.

Half-Duplex Pipelines Are Lying About Conversation

The most common architectural mistake in voice AI is also the easiest to miss: a half-duplex pipeline where the microphone is muted while the agent is speaking. It's tempting because it eliminates a real engineering problem (the agent hearing its own TTS through the mic), but it makes barge-in structurally impossible. The user cannot interrupt because the system literally cannot hear them.

Real conversation is full-duplex. Both parties can produce audio at any moment, and turn negotiation happens through overlap, not through tidy alternation. A voice agent that mutes the mic during TTS is not having a conversation; it's running a script with predictable interruption points the user has to wait for. Users learn this quickly — they stop trying to interject, they start over-explaining because they can't course-correct mid-response, and the conversations get longer and more frustrating.

Going full-duplex requires three things the half-duplex design lets you skip:

  1. Acoustic echo cancellation so the mic stream doesn't pick up the speaker output and trigger spurious VAD events.
  2. Concurrent audio streams with the mic and TTS routed independently rather than serialized through a single audio session.
  3. Explicit interruption handling at the model and tool layer, because the user can now barge in mid-sentence and you have to decide what that means for the in-flight response.

The first two are platform-level concerns and largely solved by mature voice SDKs. The third is where the architecture decisions live, and where most teams discover they don't have a graceful preemption story.

Graceful Preemption Is a State-Management Problem

When a user barges in, three things are usually in flight: TTS audio playback, model token generation, and possibly a tool call. Cancelling the audio is the easy part. Cancelling the model and the tool call without corrupting state is where systems fall over.

The minimum viable preemption contract has four parts:

  1. Cut the TTS audio at a chunk boundary, not mid-syllable. If your TTS produces 100–200ms chunks, you can stop cleanly within one frame; if it produces a single long buffer, you'll either truncate awkwardly or wait too long to acknowledge the user.
  2. Truncate the model output to what the user actually heard, not what the model generated. The token the user heard last is the conversational truth. Sending the full generated response back to context produces a transcript where the agent claims to have said things the user never heard.
  3. Decide per tool whether to cancel or complete silently. A read-only lookup is fine to cancel. A payment authorization is not — once it's been issued you have to reconcile the result, which means the next turn has to acknowledge it even though the user interrupted before hearing it.
  4. Reset the response generation state without resetting the conversation state. The new user utterance is not a new session; it's a continuation that happens to have preempted the previous response. The model's memory of the truncated reply needs to be accurate or the next turn will repeat or contradict it.

Skip any of these and the failure mode is the same: the agent's view of what it said diverges from the user's view of what they heard. From the user's seat, the agent looks confused. From the logs, everything is fine.

Silence Is Not Always Turn-End

The classic VAD design treats a fixed silence threshold (say, 700ms of no voiced audio) as a turn-end signal. This is the right baseline and the wrong default. Humans pause in the middle of utterances all the time — searching for a word, taking a breath, thinking through a number. A 700ms threshold catches all of these as turn-ends and the agent jumps in to finish the user's sentence.

Modern turn detection layers semantic and prosodic signals on top of VAD to make the silence-equals-turn-end assumption smarter:

  • Semantic VAD classifies the partial transcript and asks "does this look like a complete thought?" A trailing "ummm" or a sentence ending in "and" gets a longer grace period; a definitive "...and that's all" gets a shorter one.
  • Prosodic models look at intonation contour, pitch drop, and final-syllable lengthening to predict whether the user is winding down or just pausing. These are the cues humans actually use, and the cues your VAD-only pipeline is throwing away.
  • Voice activity projection uses transformer models trained on conversational data to predict the probability of an upcoming turn-end window, letting the system pre-stage a response before the silence even arrives.

The right architecture is hybrid. VAD remains the cheap, fast baseline. Semantic and prosodic models stretch or shrink the grace period depending on what the user actually said and how they said it. The metric to track is not VAD accuracy in isolation; it's false-cut rate (the agent interrupted a user mid-thought) versus dead-air rate (the agent waited too long after a clear turn-end). Optimizing one in isolation always wrecks the other.

Native-Audio Models Don't Eliminate the Budget — They Reshape It

The 2026 generation of speech-to-speech models — gpt-realtime, gemini-live, and friends — collapses the ASR-LLM-TTS pipeline into a single model that processes audio in and produces audio out. The marketing pitch is "250–500ms end-to-end latency" and it's mostly true. The trap is the assumption that a native-audio model removes the need for a turn-taking budget.

It doesn't. It reshapes the budget. With a cascading pipeline, the latency is dominated by serial stages and you optimize by parallelizing them. With a native-audio model, the dominant cost is the model's own first-audio time and your VAD/turn-detection layer that still has to decide when to commit a turn. The end-to-end number is smaller, but the human side of the conversation hasn't changed — 200ms is still the target, 300ms still feels slow, 1500ms still loses the user.

What a native-audio model does change is the failure surface. Prosody preservation gets better because the model isn't passing through a text bottleneck. Interruption handling gets harder because the model's audio output is more entangled with its internal state — you can't just truncate a TTS buffer cleanly. Tool calls get more awkward because the model is producing audio while the tool result hasn't returned yet. Teams that switch to a native-audio model and don't redesign their preemption and tool-orchestration layer end up with a faster product that fails in different ways than the cascading one did.

What to Write Down Before You Ship

The disciplined version of this work isn't exotic. It's a one-page document that names:

  • The four-part latency budget with target numbers per stage and an end-to-end ceiling.
  • The full-duplex audio routing diagram with explicit acoustic-echo and barge-in handling.
  • The preemption contract for TTS, model output, and each tool the agent can call.
  • The turn-detection policy: VAD as baseline, semantic and prosodic overrides, and the false-cut versus dead-air metric the team optimizes against.
  • The native-vs-cascading decision and the specific reasons that pipeline was chosen for this product.

Without this document, the team will ship a chat agent over an audio API and will be surprised when users describe the product as "weird" rather than complaining about any specific feature. With it, the work has a shape: the bottleneck is identifiable, the trade-offs are negotiable, and the eventual user complaint maps to a numbered budget line you can tune.

The architectural realization that has to land is small but load-bearing. Voice AI is not text AI with microphones strapped to it. It's a real-time system whose latency budget is set by the rhythm of human conversation, not by the SLA of your inference platform. The team that internalizes this builds an agent that feels like a conversation. The team that doesn't ships a clever script with rude pauses, and learns about the gap from the support queue.

References:Let's stay in touch and Follow me for more thoughts and updates