The Eval-to-Production Gap: Why 92% on Your Test Suite Means 40% User Satisfaction
Your team spent three weeks building a rigorous eval suite. It covers edge cases. It includes adversarial examples. The LLM-as-judge scores 92% across all dimensions. You ship.
Then the support tickets start. Users say the AI "doesn't understand what they're asking." Session abandonment is up 30%. Satisfaction scores come back at 41%.
This gap — between eval performance and real-world outcomes — is the most common failure mode in production AI systems today. It's not a model problem. It's a measurement problem.
