The Transcription Confidence Score Your Agent Trusted After the Vendor's Recalibration
The voice agent had a gate. Anything above 0.85 transcription confidence went straight to the planning step; anything below got routed to a human. The threshold had been tuned six months earlier against a labeled corpus of real customer calls, frozen into a config file, and forgotten. For six months it did exactly what it was supposed to do. Then the transcription provider shipped a model upgrade — same API, same response shape, same latency band, same documented accuracy — and over the next two weeks the agent started authorizing wire transfers to the wrong people.
"Transfer $50 to mom" became "transfer $5,000 to Tom." The new transcript came back with a confidence of 0.91, well above the gate. The downstream planner saw a confident transcript and acted on it. The customer's appeal eventually surfaced the bug, but by then the support queue had filtered out a week's worth of similar incidents as fraud disputes. The post-mortem traced the gap to a single decision the team had never made explicitly: that 0.85 from the old model and 0.85 from the new model were the same number.
They weren't. The provider's release note mentioned a "calibration update to the confidence head" in the third bullet of a six-bullet changelog. Word error rate on the provider's own benchmark was unchanged. The provider's confidence distribution on a fixed reference set had shifted upward by about 15 points. Every transcript that used to score in the 0.70–0.85 band — the band the team had explicitly chosen to gate out — now scored 0.85–0.95 and walked through the gate unchallenged.
A Confidence Score Is Not a Probability — It's a Calibration
The most common misread in voice-agent architecture is treating the confidence number a transcription API returns as a probability. It isn't. It's the output of a separate, model-specific head trained to correlate with correctness on the provider's evaluation set. A well-calibrated model assigning 80% confidence should be correct about 80% of the time. Most production ASR models are not well-calibrated, and the ones that are well-calibrated against one distribution are typically miscalibrated against another.
This matters because two providers — or two versions of the same provider's model — can produce identical raw transcripts with materially different confidence numbers. Deepgram, AssemblyAI, and Whisper all expose a floating-point confidence between 0 and 1, and the numerical ranges look interchangeable. They aren't. A 0.85 from one model is not commensurable with a 0.85 from another, and the team that swaps providers behind the same gate has changed the gate's strictness without touching the code.
The deeper issue is that even within a single provider, the confidence head can be retrained independently of the acoustic model. Vendors do this routinely. They publish a new model checkpoint with the same name, same SLA, same documented WER, and a confidence head that was trained on different data with a different temperature. The accuracy number — the only metric most customers track — is unchanged. The calibration curve is different. Research on ASR confidence calibration has spent two decades documenting how brittle these numbers are under domain shift, but production teams rarely treat that brittleness as a contract concern.
The Threshold Is a Contract With a Specific Distribution
When a team writes confidence > 0.85 into a config, they aren't picking a confidence level — they're encoding the empirical distribution of confidence scores on the data they tuned against. That number has no semantics outside that distribution. If the model's confidence head is retrained and the distribution shifts, the same threshold now corresponds to a different decision boundary.
A concrete way to see this: the constant 0.85 in the code is doing two jobs at once. It's specifying a target precision (the fraction of admitted transcripts that should be correct) and it's specifying a target acceptance rate (the fraction of transcripts that should pass the gate). Those two things are only the same number when the confidence head's calibration is held fixed. When the calibration shifts upward, the acceptance rate goes up while the team thinks the precision is staying the same. The gate gets looser by exactly the amount the team didn't measure.
Calibration drift can come from several places, and most of them won't trigger a notification:
- A provider-side model upgrade that changes the confidence head's training regime.
- A change in audio capture quality on the team's own side — different microphones, different codec, different bitrate — that pushes the runtime distribution off the tuning distribution.
- A demographic or accent shift in the user base that moves the input distribution into a region the provider's calibration set under-represented.
- A change in the ambient noise profile of the calls (background-noise levels of 55–65 dB SNR can already reduce accuracy by 15–30% depending on the codec, and confidence often moves more than accuracy does under that pressure).
In every one of these scenarios, the team's gate logic does exactly what it was told to do, on numbers that have quietly changed meaning. The agent fails confidently.
- https://hamming.ai/resources/asr-accuracy-evaluation-for-voice-agents
- https://hamming.ai/blog/7-voice-agent-asr-failure-modes-in-production
- https://hamming.ai/resources/voice-agent-troubleshooting
- https://www.softwareseni.com/when-voice-agents-go-wrong-production-failure-modes-and-how-to-prevent-them/
- https://www.evalgent.com/blog/why-voice-agents-fail-in-production
- https://arxiv.org/html/2503.15124v1
- https://arxiv.org/pdf/2509.07195
- https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ConfidenceCalibration-TASLP.pdf
- https://www.microsoft.com/en-us/research/publication/semantic-confidence-calibration-for-spoken-dialog-applications/
- https://picovoice.ai/blog/speech-to-text-word-confidence-scores/
- https://developers.deepgram.com/docs/utterances
- https://pmc.ncbi.nlm.nih.gov/articles/PMC10805329/
- https://arxiv.org/pdf/2210.04166
