Skip to main content

Your Voice Agent Trusts Every Transcription Error as Fact

· 10 min read
Tian Pan
Software Engineer

A user calls your insurance voice agent and asks about their deductible. The speech recognizer hears "the duck tibble." Your language model receives the string "the duck tibble," finds nothing coherent to do with it, and either asks a confused follow-up question or — worse — confabulates an answer about a product that does not exist. The user hangs up. Your logs show a successful turn: audio in, transcript produced, response generated, no error thrown.

That is the quiet failure at the heart of nearly every voice agent in production. The speech-to-text system did its job — it produced its single best guess. The language model did its job — it reasoned over the text it was handed. The bug lives in the gap between them, in a handoff that takes a probabilistic guess and relabels it as a fact.

Text agents do not have this problem in the same way. When a user types "the duck tibble" into a chat box, they can see the typo and fix it before hitting enter. A voice agent's user gets no such chance. They said the word correctly. The machine misheard. And by the time the misrecognition reaches the model, every trace of doubt has been thrown away.

The handoff that throws away the error bars

Every modern speech recognizer is, internally, an uncertainty machine. It does not decide on words; it scores hypotheses. At any moment it is tracking dozens of candidate transcriptions — an n-best list, or more richly a lattice — each with a probability attached. "Deductible" might score 0.62. "The duck tibble" might score 0.31. A handful of other candidates split the rest.

Then the API call returns, and almost all of that structure is discarded. The recognizer collapses its entire probability distribution down to the single highest-scoring path — the 1-best hypothesis — and hands your application a plain string. The 0.62 is gone. The runner-up is gone. The fact that this was a close call is gone.

Your language model now receives "the duck tibble" with exactly the same epistemic status as a query typed by a careful user on a mechanical keyboard. There is no field in the prompt that says "low confidence." There is no marker that says "this span was contested." The model has no way to know it is reasoning over a guess, so it reasons over it as truth. Uncertainty was destroyed at precisely the boundary that most needed to preserve it.

This is not a flaw in any one component. The recognizer is allowed to return its best guess; that is a reasonable default for the common case of dictation, where a human reads the output and corrects it. The flaw is architectural: a pipeline designed for human-in-the-loop transcription got repurposed as the front end of an autonomous agent, and nobody re-examined what the interface should carry.

Why word error rate lies to you

The instinct, when you hear about transcription errors, is to reach for a better recognizer with a lower word error rate. WER is the industry-standard metric: the percentage of words substituted, inserted, or deleted relative to a reference transcript. Professional systems aim for 5 to 10 percent. That sounds reassuring until you look at which words the errors land on.

WER treats every word equally. Misrecognizing "the" as "a" counts the same as misrecognizing "$1,500" as "$15,000" or "allergic" as "a little." But those errors are not equal. One is cosmetic; the others change the meaning, the database query, or the medical decision. A transcript can post an excellent 6 percent WER and still be wrong about the one word that determines the outcome of the call.

Practitioners measuring voice agents in 2025 found this gap large enough to need its own metric. Action Error Rate — the rate at which the agent takes the wrong action — routinely runs 10 to 30 percentage points higher than the raw WER of the same transcripts. Semantically critical errors dominate downstream failures even when overall transcription looks clean. The recognizer's average accuracy is simply the wrong number to optimize, because the agent does not act on the average. It acts on specific tokens: amounts, dates, names, account numbers, negations.

Negation is the cruelest case. "It is now" versus "it is no." "I can make that payment" versus "I can't make that payment." A single dropped or misheard syllable inverts the entire intent, and it inverts it silently, with high recognizer confidence, because both phrasings are perfectly fluent English. No fluency check downstream will catch it. The transcript reads like something a person would say, because it is — just not this person, on this call.

The escape hatch your user does not have

It is worth dwelling on the asymmetry with text, because it explains why this bug is so easy to miss in a demo and so damaging in production.

In a text interface, the input the user sees and the input the model receives are the same artifact. The user is the last line of error correction, and a good one — they reread, they catch the autocorrect that turned "deductible" into something absurd, they fix it. The interface gives them a window onto exactly what the system will process.

A voice interface breaks that loop. The user knows what they said. They have no idea what the machine heard. There is no rendered transcript scrolling past their eyes, no chance to interject "no, I said deductible." The misrecognition becomes part of the agent's reality, and the user finds out only when the agent's response makes no sense — by which point the conversation has already gone sideways and recovering it costs the user patience your product can rarely afford.

This means a voice agent cannot lean on the user to absorb recognition errors the way a text agent leans on them to absorb typos. The error-correction responsibility that text quietly outsources to the human has to be designed back into the system. If you do not own it explicitly, nobody owns it.

Passing the uncertainty downstream instead of flattening it

The fix starts at the interface. If the recognizer knows a span was contested, that knowledge should survive the handoff rather than being collapsed away.

The most direct version: stop consuming only the 1-best string. Most recognition APIs can return an n-best list — the top several candidate transcriptions — and per-word or per-span confidence scores. Research on exploiting ASR uncertainty has shown that prompting a language model with the n-best list, rather than just the top hypothesis, measurably improves downstream tasks like intent detection and device-directed speech detection. The model can see that "deductible" and "the duck tibble" were the two leading candidates, recognize that one is a coherent insurance term and one is noise, and reconcile accordingly. You are not asking the model to be a better recognizer. You are giving it the evidence it needs to do the disambiguation it is already good at.

A few practical patterns follow from this:

  • Carry confidence into the prompt. Annotate low-confidence spans explicitly — even a crude marker like [uncertain: deductible|duck tibble] — so the model reasons over a guess as a guess, not as fact.
  • Reconcile rather than rescore blindly. When you have multiple hypotheses, evaluate where they agree, where they diverge, and whether the divergence touches a high-stakes token. Agreement on the boring words plus disagreement on the amount is a precise, actionable signal.
  • Log the distribution, not just the winner. Store n-best output, per-word confidence, timestamps, and model versions for every turn. You cannot debug a cascading failure from a single collapsed string, and watching the confidence distribution drift over time is an early warning that audio quality or the recognizer itself has changed.

None of this requires a unified speech-native model, though those help by never collapsing to text in the first place. It mostly requires treating the recognizer's output as what it actually is — a ranked set of hypotheses with probabilities — instead of the one thing it is most convenient to treat it as.

Designing the agent to ask before it acts

Preserving uncertainty is only useful if the agent does something with it. The second half of the fix is behavioral: the agent should ask for confirmation when confidence is low or the stakes are high, and only then.

The discipline here is to separate two axes that teams tend to conflate. One axis is recognizer confidence — how sure the system is about what it heard. The other is consequence — how much it costs to be wrong. You confirm when either is alarming, and you stay quiet when both are fine.

A low-confidence span on a throwaway word does not need a confirmation; interrupting the user to re-confirm "the" is its own kind of failure, the agent that asks "sorry, can you repeat that?" four times until the caller gives up. But a low-confidence span on an account number always warrants one. And a high-stakes span — a payment amount, a medication, a cancellation, a booking date — warrants explicit confirmation even when the recognizer is confident, because the recognizer's confidence is exactly the thing that can be wrong. The conventional split holds up well: explicit confirmation for high-stakes actions, implicit confirmation for low-stakes ones, and a clarification question whenever a contested span and a consequential slot overlap.

Good confirmation is also specific. "I'm not sure I caught that, could you repeat the whole thing?" makes the user redo work and frustrates them. "I have your deductible question — was that fifteen hundred dollars, or fifteen thousand?" targets exactly the contested token, takes one second to answer, and shows the user the system is paying attention. For inputs like email addresses or confirmation numbers, breaking the slot into smaller pieces and confirming each — or asking the user to spell — costs a few seconds and removes an entire class of silent failure.

Treat the transcript as evidence, not as the user's words

The mental shift worth carrying out of all this is small and load-bearing. A transcript is not what the user said. It is the recognizer's best estimate of what the user said, and that estimate comes with error bars whether or not your system bothers to look at them.

A voice agent that treats the transcript as ground truth is building reasoning on a foundation it has no way to audit. A voice agent that treats the transcript as evidence — ranked, scored, sometimes contested, occasionally just wrong — can do the things a careful human listener does: notice when a word does not fit, weigh a strange-sounding phrase against context, and ask before acting on something that matters.

Concretely, that means three things in your next voice project. Stop consuming only the 1-best string; take the n-best list and the confidence scores. Stop optimizing WER as if every word mattered equally; measure action error rate and instrument the tokens that actually drive decisions. And give the agent an explicit confirmation behavior keyed to confidence and stakes, so the moments that matter get a second look. The recognizer will keep making mistakes — that is its nature. Your job is to make sure those mistakes arrive labeled as mistakes, not laundered into facts.

References:Let's stay in touch and Follow me for more thoughts and updates