Voice Agents Are Not Chat Agents With a Microphone: The Half-Duplex Tax
A voice agent that scores perfectly on every transcript-level benchmark can still feel subtly wrong on a real call. The words are right. The reasoning is right. The latency number on your dashboard reads 520ms end-to-end, which was the target. And yet the person on the other end keeps stumbling, talking over the agent, restarting their sentences, hanging up early. The team ships a better model, the numbers improve, the feeling does not.
The reason has almost nothing to do with what the model says and almost everything to do with when it says it. Voice is not text with audio attached. Human conversation runs on a tight half-duplex protocol with barge-in, backchannel, and overlapping speech, and the timing budgets are measured in milliseconds. Most voice agent problems, once you get past the first week of hallucination fixes, are turn-negotiation problems. And turn negotiation is architectural — you cannot prompt your way out of it.
