Skip to main content

Why '92% Accurate' Is Almost Always a Lie

· 8 min read
Tian Pan
Software Engineer

You launch an AI feature. The model gets 92% accuracy on your holdout set. You present this to the VP of Product, the legal team, and the head of customer success. Everyone nods. The feature ships.

Three months later, a customer segment you didn't specifically test is experiencing a 40% error rate. Legal is asking questions. Customer success is fielding escalations. The VP of Product wants to know why no one flagged this.

The 92% figure was technically correct. It was also nearly useless as a decision-making input — because headline accuracy collapses exactly the information that matters most.

What a Single Accuracy Number Hides

A model that achieves 92% accuracy on a balanced dataset has an 8% error rate. That sounds manageable. But that 8% is not distributed uniformly, and the headline number tells you nothing about:

  • Which errors are harmful vs. recoverable. A wrong product recommendation that the user ignores is not the same as a wrong fraud flag that freezes a legitimate account.
  • Where errors concentrate. A 92% aggregate figure can mask a 60% error rate on your highest-value user segment if that segment is a small fraction of your test set.
  • What the model does when uncertain. Some models output a confident wrong answer; others abstain or defer. These behave very differently in production.
  • Whether the error rate is stable. A model that's 92% accurate on your historical data may degrade sharply on new input distributions without any obvious signal.

The Apple Card credit algorithm controversy is a well-documented case of a model that performed well on aggregate metrics while systematically underserving a specific demographic. The headline number was fine. The distribution beneath it was not.

A Four-Dimension Error Taxonomy That Drives Real Decisions

When briefing non-technical stakeholders, stop presenting a single accuracy number. Instead, classify the model's outputs into four buckets:

1. Correct. The model predicted the right thing and the user got value from it. This is the denominator you want to grow.

2. Wrong-but-recoverable. The model made an error, but the downstream consequence is low-cost to fix. A misclassified support ticket gets rerouted within minutes. A wrong product suggestion gets ignored. These errors matter for user experience, but they rarely drive legal or financial exposure.

3. Wrong-and-harmful. The model made an error with a significant downstream consequence. A missed fraud signal that costs money. A wrong medical triage classification. A biased credit decision. These errors have asymmetric cost — one harmful error can outweigh hundreds of correct predictions in terms of actual impact.

4. Abstained. The model declined to predict or routed to a human reviewer. This is often a feature, not a failure — a well-calibrated model that knows its limits produces better outcomes than one that confidently answers everything. Tracking abstention rate separately tells you how much of your traffic the model is actually handling.

This taxonomy forces a conversation that "92% accurate" never does. Stakeholders who cannot distinguish between recoverable and harmful errors will make systematically wrong tradeoffs — they will either over-invest in reducing errors that don't matter much, or under-invest in fixing errors that are genuinely damaging.

Accuracy Distributes Across User Segments, Not Just Test Sets

Most model evaluations aggregate across the entire test set. This is the right starting point, but it is a dangerous ending point.

The users who generate the most edge cases for your model are rarely proportionally represented in historical training or test data. A language model fine-tuned on English-heavy data will underperform for users writing in mixed-language contexts. A fraud detection model trained on desktop web sessions will have higher error rates on mobile-first users from markets that onboarded later.

Before presenting accuracy to stakeholders, segment it across at least two axes:

  • User cohorts that matter to the business. High-LTV customers, users in regulated markets, recently onboarded users with sparse history.
  • Input difficulty. Short vs. long inputs, ambiguous vs. unambiguous requests, high-confidence vs. low-confidence model outputs.

A model that performs at 95% on easy inputs and 60% on hard inputs is telling you something critical about where to invest. An aggregate of 88% tells you almost nothing about where to look.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates