LLM-as-Classifier in Production: Why Accuracy Is the Wrong Metric
A team ships an LLM-based intent classifier. Evaluation accuracy: 94%. Two weeks into production, support volume is up 30% — not because the model is failing to classify, but because it's routing edge cases to the wrong queue with very high confidence. Nobody built a circuit breaker for "the model is wrong and doesn't know it." The 94% figure never surfaced that risk.
This failure pattern repeats across content moderation pipelines, routing systems, and entity extractors. The LLM gets a high score on the holdout set. The team ships. Something breaks quietly in production.
The issue isn't that accuracy is a bad metric. It's that accuracy answers the wrong question. Production classification has a different set of requirements, and most evaluation pipelines don't test for them.
