Why Your Voice Agent Feels Rude: Turn-Taking Is a Latency Budget You Never Wrote Down
The first time you ship a voice agent, you'll get the same complaint twice: "It interrupted me," and "It feels rude." Both are the same bug. The agent isn't impolite — it's running on a latency budget you never wrote down. The chat-style instinct that says "respond when complete" produces a system that, in voice, feels like talking to someone who keeps stepping on your sentences and going silent at all the wrong moments.
Conversational turn-taking in humans happens in a window of roughly 100 to 300 milliseconds, and it does so across every language ever measured. A median 200ms inter-speaker gap isn't an aspiration; it's the baseline humans calibrate against. Anything slower reads as confusion, anything faster reads as interruption, and a voice agent that doesn't model the rhythm explicitly is going to land in one bucket or the other on every turn.
The fix isn't a faster model. It's accepting that voice AI is a soft real-time system whose budget is set by human conversational physics, and writing the budget down before you ship.
