Skip to main content

LLM-as-Classifier in Production: Why Accuracy Is the Wrong Metric

· 11 min read
Tian Pan
Software Engineer

A team ships an LLM-based intent classifier. Evaluation accuracy: 94%. Two weeks into production, support volume is up 30% — not because the model is failing to classify, but because it's routing edge cases to the wrong queue with very high confidence. Nobody built a circuit breaker for "the model is wrong and doesn't know it." The 94% figure never surfaced that risk.

This failure pattern repeats across content moderation pipelines, routing systems, and entity extractors. The LLM gets a high score on the holdout set. The team ships. Something breaks quietly in production.

The issue isn't that accuracy is a bad metric. It's that accuracy answers the wrong question. Production classification has a different set of requirements, and most evaluation pipelines don't test for them.

The Four Constraints That Accuracy Ignores

When you use an LLM as a classifier in production, you're asking it to do something different from generating text. You need consistent input-to-label mapping, predictable latency, reliable confidence estimates, and stable behavior under distribution shift. Accuracy measures the first constraint in isolation and ignores the rest.

Calibration: A model is calibrated if its stated confidence matches its actual accuracy. If a model says it's 90% confident, it should be correct 90% of the time. Research on production LLM classifiers has found that GPT-4o-mini's errors cluster at the high-confidence end — roughly 66.7% of misclassifications happen when the model reports confidence above 80%. For automated routing systems that use confidence as a decision gate, this is catastrophic: the model fails hardest exactly where the system trusts it most.

Per-class performance: Overall accuracy hides class-level failure. On a dataset with a 95%/5% class split, a classifier that always predicts the majority class achieves 95% accuracy with 0% recall on the minority class. In practice, the minority class is usually what you care about most — rare but harmful content, low-frequency intents with high business impact, unusual entity types. Per-class F1 scores, not overall accuracy, reveal whether the model actually works for your use case.

Throughput and latency: Classification in production has SLOs that general LLM benchmarks don't measure. Intent detection in a conversational system needs to complete in under 200ms. Content moderation in a streaming pipeline needs to process thousands of items per minute. LLM inference runs orders of magnitude slower than a traditional logistic regression classifier. The relevant question isn't just "is it accurate?" but "is it accurate at this latency, at this throughput, at this cost?"

Distribution stability: A model evaluated on a January holdout set may behave differently on June traffic if the input distribution has shifted. Model providers push silent updates. User behavior changes. Topics trend. Accuracy measured at a point in time says nothing about how the classifier will hold up six months later.

Calibration: The Confidence Problem

The core difficulty with LLM calibration is that the confidence mechanism is architecturally separate from the classification mechanism. LLMs are trained to generate plausible text, not to estimate their own uncertainty reliably. When you ask a model to return a label and a confidence score, the confidence score reflects the model's internal representation of its own answer, not a reliable probability.

The practical consequence: you cannot use raw LLM confidence scores as thresholds for automated decisions in high-stakes applications without first calibrating them against real production data. Expected Calibration Error (ECE) values above 0.10 are common in production deployments — meaning that a model claiming 85% confidence might only be correct 70% of the time.

There are workable mitigations. Temperature scaling adjusts the sharpness of output distributions post-hoc, without retraining. Platt scaling fits a sigmoid to the model's raw outputs, converting them into calibrated probabilities. Ensemble approaches — pooling predictions from multiple models or multiple sampling runs — improve calibration by averaging out individual model overconfidence. None of these are free, but all of them are cheaper than shipping a pipeline where the confidence scores you're routing on are systematically misleading.

The practical floor for using confidence-based routing in production: evaluate ECE on a labeled sample from your actual traffic, not the evaluation set. Then either calibrate, or redesign the routing to not depend on confidence scores at all.

Latency and Throughput Are First-Class Constraints

The latency profile of LLM inference matters differently for classification than for generation. In generation, you can stream tokens; latency is perceived across the full response. In classification, you need the complete label before you can act. Every millisecond of time-to-first-token is blocking.

For real-time applications — routing an incoming request, deciding whether to flag content before displaying it, detecting intent in a voice interface — you're typically working with latency budgets under 500ms. LLM API calls, including network transit and inference, routinely take 800ms to 2 seconds for a single synchronous request. Without careful batching, caching, or model selection, this blows the budget.

The LLM inference trilemma applies here: you can optimize for throughput (more requests per second), latency (faster individual responses), or cost (cheaper per token) — but moving any one of these levers makes the others worse. For classification workloads, the right trade-off depends on your traffic pattern. Bulk moderation pipelines with no real-time SLO can batch aggressively and tolerate higher latency. Customer-facing intent detection cannot.

Practical approaches that work in production:

  • Semantic caching: cache classification results for high-frequency inputs. Production systems typically see 20–40% cache hit rates, which reduces both latency and cost.
  • Model routing: send simple, high-confidence inputs to smaller, faster models (7B or smaller); escalate uncertain or complex inputs to larger models. This alone can reduce inference costs by 30–60%.
  • Quantized inference: 4-bit quantized models run 2–4x faster with 95–99% quality retention on most classification tasks.

Per-Class Testing and the Confusion Matrix Requirement

An accuracy figure without a confusion matrix is an incomplete evaluation. Before promoting an LLM classifier to production, you need the full breakdown: for each class, how often does the model correctly identify it, misidentify it as something else, or miss it entirely?

The confusion matrix surfaces failure patterns that matter for operational decisions. If your content moderation classifier confuses "harassment" with "spam" — both are true positives in terms of flagging, but they route to different queues with different human review processes. If your intent classifier misroutes "billing dispute" as "technical support," the user gets transferred and frustrated even though the classification was "almost right."

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates