Voice Agents Are Not Chat Agents With a Microphone: The Half-Duplex Tax

April 23, 2026 · 10 min read

Software Engineer

A voice agent that scores perfectly on every transcript-level benchmark can still feel subtly wrong on a real call. The words are right. The reasoning is right. The latency number on your dashboard reads 520ms end-to-end, which was the target. And yet the person on the other end keeps stumbling, talking over the agent, restarting their sentences, hanging up early. The team ships a better model, the numbers improve, the feeling does not.

The reason has almost nothing to do with what the model says and almost everything to do with when it says it. Voice is not text with audio attached. Human conversation runs on a tight half-duplex protocol with barge-in, backchannel, and overlapping speech, and the timing budgets are measured in milliseconds. Most voice agent problems, once you get past the first week of hallucination fixes, are turn-negotiation problems. And turn negotiation is architectural — you cannot prompt your way out of it.

The latency budget is not what your dashboard thinks

When teams migrate a chat product to voice, the first instinct is to reuse the latency SLO. "We promised p95 under 1 second, we'll do that for voice too." This is already too slow, and it's too slow in a way the dashboard cannot see.

The number that matters in conversation is not the end-to-end latency from request to response. It's the perceived gap between when the user stops speaking and when the agent starts speaking. Conversation analysts have measured the average inter-turn gap across languages and it clusters around 200 milliseconds — shorter in some cultures, longer in others, but never anywhere near a full second. Push past 300ms and users start to consciously register the pause. Push past 800ms and they assume the call dropped.

Here is what makes the budget cruel. The 200ms target is not from "send the audio to the server" to "audio starts playing from the speaker." It's from "user's final phoneme hits the microphone" to "first phoneme of the agent's response hits the earpiece." Out of that you have to pay for voice-activity detection, end-of-utterance decision, audio transport, speech-to-text finalization, LLM first-token latency, text-to-speech first-chunk latency, and output audio transport. Every component wants its share. If your STT waits 400ms of silence before finalizing, you have already lost before the LLM is even invoked.

The other failure mode is counterintuitive: too fast is also wrong. Modern end-to-end speech models can respond in 150ms, which sounds great on a benchmark and feels robotic on a call. Human conversation has a cognitive shape — a reasonable pause before a thoughtful answer — and collapsing that shape flattens the agent into something uncanny. The target is not "as fast as possible." The target is "indistinguishable from an attentive human with a low-latency phone connection."

Turn detection is a model problem, not a silence problem

The default architecture for voice agents uses voice-activity detection (VAD) to decide when the user has stopped talking. VAD looks at the audio signal and says "yes there's voice" or "no there isn't." When the "no" persists past a configurable threshold — typically 500 to 800 milliseconds — the system decides the turn has ended and triggers the response.

This works beautifully in benchmarks where users speak in complete, uninterrupted sentences. It fails on real calls for three reasons.

First, people pause mid-thought. They stop talking for 600 milliseconds while they reach for a word, and a VAD-only system has already cut them off and sent a truncated utterance to the LLM. The LLM responds confidently to a sentence fragment. The user is mid-sentence when the agent starts talking, and now both parties are in an overlap.

Second, people backchannel. "Uh-huh," "right," "okay" — these are not turn-grabs, they're continuers that signal attention. A naive VAD treats them as new user turns and interrupts the agent's ongoing utterance. The agent stops mid-sentence, the user realizes they've been misheard, and the conversation falls apart.

Third, environmental noise — a cough, a door closing, a passing siren — looks identical to VAD. On a noisy phone line, VAD-only turn detection produces an agent that constantly gets distracted.

The fix is semantic end-of-turn detection. Instead of just measuring silence, a small transformer-based classifier looks at the partial transcript and predicts whether the sentence is syntactically and semantically complete. Production systems now use compact models in the 8M–135M parameter range, running on every transcription update, outputting a probability that the user is actually done. When the probability is low, the system extends the silence timeout. When it's high, it can finalize earlier than the silence threshold would suggest.

This is the kind of architectural seam that does not show up in offline evals. You cannot measure it with BLEU or WER or task-success rate. You measure it with false-interrupt rate, false-early-finalization rate, and user restart rate. The teams that treat turn detection as a first-class modeled component — not a silence threshold to tune — are the ones whose agents sound like they're listening.

Barge-in is not an optional feature

A voice agent that cannot be interrupted feels like an IVR tree with a personality. The moment the user realizes they're trapped in a monologue, they start pressing zero, yelling "agent," or hanging up.

Making barge-in work is architecturally invasive. It's not a flag you flip. When the agent is speaking and the user starts speaking, the system has to decide in under 100ms whether this is a real interruption or a backchannel. If it's a real interruption, it has to stop TTS playback immediately, cancel the in-flight LLM generation, discard the partial response, re-transcribe the user's new input, and generate a response that acknowledges the interruption without repeating the cut-off text.

This has a cascade of consequences that teams discover only after launch.

TTS systems optimized for throughput will buffer entire paragraphs of audio ahead of playback for smooth delivery. If you want snappy barge-in, the buffer becomes your enemy — the user interrupts and the agent keeps talking for another 800ms because the audio is already in flight. You have to design for small chunks, low-latency streaming, and aggressive cancellation, which fights every throughput optimization in the stack.

LLM partial-output commitment becomes a correctness problem. If the agent has already said the first sentence of its response out loud and the user interrupts, the second sentence — which the model hasn't finished generating yet — should be discarded. But if the agent has already committed to performing a tool call based on the full response plan, that tool call might still fire unless your orchestration explicitly binds tool execution to output that the user actually heard. More than one production incident has come from an agent that confirmed "I'll cancel your reservation" in voice, got interrupted before it finished the confirmation, and then ran the cancellation anyway because the LLM's response had already been parsed.

Echo cancellation becomes a correctness issue too. If the agent's own voice bleeds into the input microphone on a phone call, naive VAD treats it as user speech and the agent starts interrupting itself. Duplex processing with good echo suppression is the difference between "voice agent" and "voice agent-shaped artifact."

The 200ms budget, allocated

It helps to walk through where the time goes on a realistic call. Here's a breakdown that treats the budget as something to be defended, not a goal to be missed gracefully.

End-of-utterance decision: 50–150ms. A semantic turn-detection model runs on every ASR partial. When its confidence crosses threshold, it shortcircuits the VAD silence timer. With this in place, you can often finalize within 100ms of the user's last phoneme rather than waiting the full VAD threshold.
Audio transport: 30–80ms. Network round-trip plus buffering. On cellular this can swing wildly. WebRTC with opus at low bitrate is the realistic floor.
ASR finalization: 50–150ms. Streaming ASR delivers partial hypotheses throughout the turn. Finalization is the last rescoring pass. Models that defer everything to a final pass will blow this budget.
LLM first-token latency: 200–500ms. This is where the budget is almost always lost. Reasoning models are especially hostile here — their first token can be seconds in, with the "thinking" happening invisibly. Voice agents that need reasoning should do the reasoning against cached context before the turn ends, or use a cheaper, faster model on the critical path with a reasoning tier in the background.
TTS first-chunk latency: 100–300ms. Streaming TTS that starts synthesizing on the first text token gives you a much lower time-to-first-audio than buffered TTS. Voice cloning and expressive TTS typically cost more here.
Output transport: 30–80ms. Same as input.

Add it up and you are already past 600ms in the optimistic case. The only way to claim the psychologically important 300ms zone is to overlap these stages aggressively — starting TTS on the first LLM token, predicting the end of turn before silence hits threshold, and compressing LLM latency with aggressive context caching or speculative decoding.

Teams that try to hit this budget with a cascaded pipeline assembled from generic components will fail. The shift to integrated voice-native stacks — end-to-end speech models like the ones underpinning real-time APIs, or tightly-engineered hybrid systems — is not hype. It's the only way to claw back enough milliseconds to sound natural.

What this means for your roadmap

The practical implication for engineering teams is that a voice product cannot be "a chat product with the transport layer swapped." It needs a different skill set, a different eval methodology, and a different failure-mode culture.

Evals have to include timing. Task success rate at the transcript level is not enough — you need metrics for false-interrupt rate, agent-interruption-tolerated rate (did the barge-in work when the user wanted to interrupt?), time-to-first-audio from end-of-utterance, and turn-overlap rate. These are conversation-shape metrics, not content metrics, and most LLM eval frameworks do not emit them by default. You have to instrument the voice stack itself.

Incident classification has to account for the new failure modes. An "agent said the wrong thing" ticket is the easy case. The harder tickets are "agent kept interrupting me," "agent took a full second to respond," "agent finished my sentence for me." These are timing bugs, not prompt bugs, and diagnosing them requires audio-level traces with phoneme-accurate timestamps.

Hiring has to shift. The people who are great at prompt engineering and tool design are not always the people who can debug a VAD false-positive rate on cellular audio. Voice teams that succeed have someone who has suffered through WebRTC at some point in their career, and someone who thinks about echo cancellation before it becomes a P0.

The model is the easy part. The conversation protocol is the hard part. Voice agents that feel natural are voice agents whose teams treated turn negotiation as a first-class engineering problem and spent the months necessary to get the milliseconds back. The ones that feel uncanny are the ones that shipped a chat product into a phone call and hoped the model would paper over the timing.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Voice Agents Are Not Chat Agents With a Microphone: The Half-Duplex Tax

The latency budget is not what your dashboard thinks

Turn detection is a model problem, not a silence problem

Barge-in is not an optional feature

The 200ms budget, allocated

What this means for your roadmap

Recommended Reading

About Tian Pan

The latency budget is not what your dashboard thinks​

Turn detection is a model problem, not a silence problem​

Barge-in is not an optional feature​

The 200ms budget, allocated​

What this means for your roadmap​

Recommended Reading

About Tian Pan

The latency budget is not what your dashboard thinks

Turn detection is a model problem, not a silence problem

Barge-in is not an optional feature

The 200ms budget, allocated

What this means for your roadmap