Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You
An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.
This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.
