Skip to main content

3 posts tagged with "voice-ai"

View all tags

Why Your Voice Agent Feels Rude: Turn-Taking Is a Latency Budget You Never Wrote Down

· 11 min read
Tian Pan
Software Engineer

The first time you ship a voice agent, you'll get the same complaint twice: "It interrupted me," and "It feels rude." Both are the same bug. The agent isn't impolite — it's running on a latency budget you never wrote down. The chat-style instinct that says "respond when complete" produces a system that, in voice, feels like talking to someone who keeps stepping on your sentences and going silent at all the wrong moments.

Conversational turn-taking in humans happens in a window of roughly 100 to 300 milliseconds, and it does so across every language ever measured. A median 200ms inter-speaker gap isn't an aspiration; it's the baseline humans calibrate against. Anything slower reads as confusion, anything faster reads as interruption, and a voice agent that doesn't model the rhythm explicitly is going to land in one bucket or the other on every turn.

The fix isn't a faster model. It's accepting that voice AI is a soft real-time system whose budget is set by human conversational physics, and writing the budget down before you ship.

Voice Agents Are Not Chat Agents With a Microphone: The Half-Duplex Tax

· 10 min read
Tian Pan
Software Engineer

A voice agent that scores perfectly on every transcript-level benchmark can still feel subtly wrong on a real call. The words are right. The reasoning is right. The latency number on your dashboard reads 520ms end-to-end, which was the target. And yet the person on the other end keeps stumbling, talking over the agent, restarting their sentences, hanging up early. The team ships a better model, the numbers improve, the feeling does not.

The reason has almost nothing to do with what the model says and almost everything to do with when it says it. Voice is not text with audio attached. Human conversation runs on a tight half-duplex protocol with barge-in, backchannel, and overlapping speech, and the timing budgets are measured in milliseconds. Most voice agent problems, once you get past the first week of hallucination fixes, are turn-negotiation problems. And turn negotiation is architectural — you cannot prompt your way out of it.

Voice AI in Production: Engineering the 300ms Latency Budget

· 10 min read
Tian Pan
Software Engineer

Most teams building voice AI discover the latency problem the same way: in production, with real users. The demo feels fine. The prototype sounds impressive. Then someone uses it on an actual phone call and says it feels robotic — not because the voice sounds bad, but because there's a slight pause before every response that makes the whole interaction feel like talking to someone with a bad satellite connection.

That pause is almost always between 600ms and 1.5 seconds. The target is under 300ms. The gap between those two numbers explains everything about how voice AI systems are actually built.