Why '92% Accurate' Is Almost Always a Lie
You launch an AI feature. The model gets 92% accuracy on your holdout set. You present this to the VP of Product, the legal team, and the head of customer success. Everyone nods. The feature ships.
Three months later, a customer segment you didn't specifically test is experiencing a 40% error rate. Legal is asking questions. Customer success is fielding escalations. The VP of Product wants to know why no one flagged this.
The 92% figure was technically correct. It was also nearly useless as a decision-making input — because headline accuracy collapses exactly the information that matters most.
What a Single Accuracy Number Hides
A model that achieves 92% accuracy on a balanced dataset has an 8% error rate. That sounds manageable. But that 8% is not distributed uniformly, and the headline number tells you nothing about:
- Which errors are harmful vs. recoverable. A wrong product recommendation that the user ignores is not the same as a wrong fraud flag that freezes a legitimate account.
- Where errors concentrate. A 92% aggregate figure can mask a 60% error rate on your highest-value user segment if that segment is a small fraction of your test set.
- What the model does when uncertain. Some models output a confident wrong answer; others abstain or defer. These behave very differently in production.
- Whether the error rate is stable. A model that's 92% accurate on your historical data may degrade sharply on new input distributions without any obvious signal.
The Apple Card credit algorithm controversy is a well-documented case of a model that performed well on aggregate metrics while systematically underserving a specific demographic. The headline number was fine. The distribution beneath it was not.
A Four-Dimension Error Taxonomy That Drives Real Decisions
When briefing non-technical stakeholders, stop presenting a single accuracy number. Instead, classify the model's outputs into four buckets:
1. Correct. The model predicted the right thing and the user got value from it. This is the denominator you want to grow.
2. Wrong-but-recoverable. The model made an error, but the downstream consequence is low-cost to fix. A misclassified support ticket gets rerouted within minutes. A wrong product suggestion gets ignored. These errors matter for user experience, but they rarely drive legal or financial exposure.
3. Wrong-and-harmful. The model made an error with a significant downstream consequence. A missed fraud signal that costs money. A wrong medical triage classification. A biased credit decision. These errors have asymmetric cost — one harmful error can outweigh hundreds of correct predictions in terms of actual impact.
4. Abstained. The model declined to predict or routed to a human reviewer. This is often a feature, not a failure — a well-calibrated model that knows its limits produces better outcomes than one that confidently answers everything. Tracking abstention rate separately tells you how much of your traffic the model is actually handling.
This taxonomy forces a conversation that "92% accurate" never does. Stakeholders who cannot distinguish between recoverable and harmful errors will make systematically wrong tradeoffs — they will either over-invest in reducing errors that don't matter much, or under-invest in fixing errors that are genuinely damaging.
Accuracy Distributes Across User Segments, Not Just Test Sets
Most model evaluations aggregate across the entire test set. This is the right starting point, but it is a dangerous ending point.
The users who generate the most edge cases for your model are rarely proportionally represented in historical training or test data. A language model fine-tuned on English-heavy data will underperform for users writing in mixed-language contexts. A fraud detection model trained on desktop web sessions will have higher error rates on mobile-first users from markets that onboarded later.
Before presenting accuracy to stakeholders, segment it across at least two axes:
- User cohorts that matter to the business. High-LTV customers, users in regulated markets, recently onboarded users with sparse history.
- Input difficulty. Short vs. long inputs, ambiguous vs. unambiguous requests, high-confidence vs. low-confidence model outputs.
A model that performs at 95% on easy inputs and 60% on hard inputs is telling you something critical about where to invest. An aggregate of 88% tells you almost nothing about where to look.
The One-Page Format That Prevents Miscalibration
The goal of communicating accuracy to non-technical stakeholders is not to make them feel good about a number. It is to give them exactly enough information to make correct product, legal, and investment decisions. That requires a format that is dense but not technical.
A useful one-page accuracy brief contains:
The headline: Overall accuracy on the current evaluation set, with a comparison to the previous version and a clear note on what improved or regressed.
Error breakdown: How many errors fall into each of the four buckets — correct, wrong-but-recoverable, wrong-and-harmful, abstained — expressed as percentages and as absolute counts projected to your actual traffic volume. Absolute counts matter; "0.5% harmful errors" lands differently than "~500 harmful decisions per day at current traffic."
Segment performance: Accuracy for the two or three user cohorts that are most important to the business or most likely to be affected by errors. If a segment has meaningfully different error rates, say so explicitly.
What the model does not handle. Every model has a defined input domain. When inputs fall outside that domain, the model either abstains or degrades gracefully. Stakeholders need to know what the boundaries are, not just how the model performs inside them.
The threshold that was chosen and why. Most classification models have a decision threshold — the confidence score above which the model commits to a prediction rather than abstaining. Lowering the threshold increases recall at the cost of more false positives. Raising it does the opposite. Stakeholders who don't know the threshold was chosen cannot evaluate whether it was chosen correctly for their context.
One concrete format: a table with rows for each stakeholder-relevant segment and columns for correct, recoverable-error, harmful-error, and abstained rates. Add a single row at the top for the aggregate. The whole thing fits on one page and answers every question a legal or product review should be asking.
Why "Explain It Simply" Is Not the Problem
Engineers often approach non-technical stakeholder communication as a problem of simplification: how do we make the math accessible? This framing gets the problem backwards.
The issue is not that stakeholders cannot understand precision-recall tradeoffs. The issue is that engineers frequently present information that is technically correct but strategically incomplete. A single accuracy figure is not a simplification of the full picture — it is a compression that loses the signal that matters most for decisions.
When a PM asks "how accurate is the model," they are not asking for a math lesson. They are asking: can I ship this, and if something goes wrong, what will I be responsible for? A number like 92% does not answer that question. A breakdown that shows "of the errors we make, fewer than 0.3% are in the harmful category, and those are concentrated in inputs the model flags as low-confidence" does answer it.
The goal is not to make the information simpler. It is to make it complete for the decision that needs to be made.
Accuracy Alone Cannot Tell You Whether to Ship
The final failure mode of headline metrics is that they are static. A model evaluated at 92% accuracy in November, trained on data through September, may be operating at 84% accuracy by the following March as input distributions shift — and that degradation will not be visible in the accuracy figure you presented at launch.
Production accuracy is a moving target. It decays as the world changes in ways your training data did not anticipate. New user behaviors, regulatory shifts, seasonal patterns, and distribution shifts in upstream data all erode model performance over time. A stakeholder who was told "92% accurate" at launch and was not told "this figure needs to be re-evaluated quarterly" has been given an incomplete picture.
The monitoring pattern that closes this gap is to instrument the four-bucket taxonomy in production, not just in evaluation. Track your harmful-error rate as an operational metric with an alert threshold, just as you would track p99 latency or error rate for a service. When the harmful-error rate crosses a threshold, that is a production incident — not a scheduled review agenda item.
Non-technical stakeholders can reason about this if you give them the framing. "We have an alert that fires if the harmful-error rate exceeds 0.4% of traffic" is a sentence any product leader or lawyer can understand and hold engineering accountable to.
The Takeaway
When you present AI accuracy to non-technical stakeholders, you are not doing math communication. You are doing risk communication. The relevant questions are not "how often is the model right?" but "when it's wrong, who gets hurt, and can we detect it before they do?"
Build your accuracy brief around the four-bucket taxonomy, segment it for the cohorts that matter to your business, state the threshold you chose and why, and commit to a production monitoring number — not just an evaluation metric. That is the format that leads to correct product decisions, appropriate legal posture, and engineering accountability that holds up when something eventually goes wrong.
Because something will.
- https://productschool.com/blog/artificial-intelligence/evaluation-metrics
- https://www.udacity.com/blog/2025/05/measuring-ml-model-accuracy-metrics-and-pitfalls-to-avoid.html
- https://galileo.ai/blog/accuracy-metrics-ai-evaluation
- https://www.sigmacomputing.com/blog/ai-machine-learning-bi-solutions
- https://sendbird.com/blog/ai-metrics-guide
- https://arxiv.org/html/2407.18418v3
