The Transcription Confidence Score Your Agent Trusted After the Vendor's Recalibration
The voice agent had a gate. Anything above 0.85 transcription confidence went straight to the planning step; anything below got routed to a human. The threshold had been tuned six months earlier against a labeled corpus of real customer calls, frozen into a config file, and forgotten. For six months it did exactly what it was supposed to do. Then the transcription provider shipped a model upgrade — same API, same response shape, same latency band, same documented accuracy — and over the next two weeks the agent started authorizing wire transfers to the wrong people.
"Transfer $50 to mom" became "transfer $5,000 to Tom." The new transcript came back with a confidence of 0.91, well above the gate. The downstream planner saw a confident transcript and acted on it. The customer's appeal eventually surfaced the bug, but by then the support queue had filtered out a week's worth of similar incidents as fraud disputes. The post-mortem traced the gap to a single decision the team had never made explicitly: that 0.85 from the old model and 0.85 from the new model were the same number.
