Escalation Rate Is the Eval Signal Your Offline Tests Missed
Every agent feature has a back door. Some teams call it "escalate to support." Some call it "route to a human reviewer." Some call it the templated "I'm not able to help with that — let me connect you to someone who can." Whatever the label, every production agent has a path that gives up on the user's request and hands it to a human, and the rate at which production traffic takes that path is one of the few signals that doesn't depend on labelers, judges, or a hand-built test set. It is the system telling you, in production, that the model could not handle a request the user actually sent.
That signal is almost always being read by the wrong team. Escalation rate is a workforce-planning metric in most companies: it determines how many human agents the queue needs next quarter, and it lives on a dashboard the operations team reviews on a different cadence than the AI team reads its eval scores. A 30% week-over-week escalation increase shows up as a staffing question in a Monday operations review, while the AI team's eval suite stays green and the leadership readout says the feature is healthy. Both teams are looking at the same production system and arriving at opposite conclusions: ops thinks they need more headcount, AI thinks the model is fine.
