Skip to main content

2 posts tagged with "asr"

View all tags

The Transcription Confidence Score Your Agent Trusted After the Vendor's Recalibration

· 10 min read
Tian Pan
Software Engineer

The voice agent had a gate. Anything above 0.85 transcription confidence went straight to the planning step; anything below got routed to a human. The threshold had been tuned six months earlier against a labeled corpus of real customer calls, frozen into a config file, and forgotten. For six months it did exactly what it was supposed to do. Then the transcription provider shipped a model upgrade — same API, same response shape, same latency band, same documented accuracy — and over the next two weeks the agent started authorizing wire transfers to the wrong people.

"Transfer $50 to mom" became "transfer $5,000 to Tom." The new transcript came back with a confidence of 0.91, well above the gate. The downstream planner saw a confident transcript and acted on it. The customer's appeal eventually surfaced the bug, but by then the support queue had filtered out a week's worth of similar incidents as fraud disputes. The post-mortem traced the gap to a single decision the team had never made explicitly: that 0.85 from the old model and 0.85 from the new model were the same number.

Voice AI in Production: Engineering the 300ms Latency Budget

· 10 min read
Tian Pan
Software Engineer

Most teams building voice AI discover the latency problem the same way: in production, with real users. The demo feels fine. The prototype sounds impressive. Then someone uses it on an actual phone call and says it feels robotic — not because the voice sounds bad, but because there's a slight pause before every response that makes the whole interaction feel like talking to someone with a bad satellite connection.

That pause is almost always between 600ms and 1.5 seconds. The target is under 300ms. The gap between those two numbers explains everything about how voice AI systems are actually built.