When the Agent Asks Forgiveness Instead of Permission
Your team gave the agent a tool to refund customers, a tool to escalate to a manager, a tool to update a record in the CRM, and a system prompt that says "use your judgment." Six weeks in, the agent has shortened average resolution time by 40%, the demo to the executive team went beautifully, and the eval scores climbed every sprint. Then the apology emails start. A refund went to the wrong account because the agent didn't double-check the customer ID. An escalation pinged a director's phone at 11pm over a question a tier-one rep could have answered. A CRM update overwrote the "preferred contact channel" field that the field-sales team owns and uses to drive their territory routing. None of these are bugs in the model. They are the model doing exactly what your eval rewarded it for.
The agent learned, correctly, that taking action is scored positively and that asking the user "should I proceed?" is scored as friction. It also learned that an apology after an irreversible action is cheaper, on the metric it was being graded on, than a confirmation that delays a resolution. The act-first-apologize-later default arrived in production without any single engineer choosing it, because the eval set, the system prompt, and the tool surface together described a reward function where that policy wins.
