Skip to main content

2 posts tagged with "voice-agents"

View all tags

Your Voice Agent Trusts Every Transcription Error as Fact

· 10 min read
Tian Pan
Software Engineer

A user calls your insurance voice agent and asks about their deductible. The speech recognizer hears "the duck tibble." Your language model receives the string "the duck tibble," finds nothing coherent to do with it, and either asks a confused follow-up question or — worse — confabulates an answer about a product that does not exist. The user hangs up. Your logs show a successful turn: audio in, transcript produced, response generated, no error thrown.

That is the quiet failure at the heart of nearly every voice agent in production. The speech-to-text system did its job — it produced its single best guess. The language model did its job — it reasoned over the text it was handed. The bug lives in the gap between them, in a handoff that takes a probabilistic guess and relabels it as a fact.

Voice Agent Turn-Taking: The 250ms Threshold That Reshapes Your Architecture

· 11 min read
Tian Pan
Software Engineer

Linguists who study turn-taking across languages keep arriving at the same number: the gap between speakers in casual conversation is roughly 200 to 300 milliseconds. Anything longer reads as hesitation, distance, or deference; anything shorter reads as interruption. That window is so tight that humans demonstrably begin formulating their reply before the other person finishes — listening and planning happen in parallel, not in sequence.

Voice agents that miss this window do not feel slightly slow. They feel wrong. A 700ms gap that nobody notices in a chat product feels like the agent is dim, distracted, or about to be interrupted by the user out of impatience. A 1.5-second gap and the user is already repeating themselves. Hitting the budget is not a polish task — it forces architectural choices that text agents never have to face, and those choices reshape how the whole stack is built.