The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust
A feature that fails 70% of the time is harmless. The user learns within a week that they have to verify every output, treats the system as an unreliable assistant, and adjusts. A feature that succeeds 70% of the time is worse than that. It is right often enough that the user stops verifying, and wrong often enough that the failures are concentrated, visible, and personal. The user's mental model collapses into "I cannot tell when to trust this" — which, as a product experience, is strictly worse than "I know not to trust this."
This is the 70% uncanny valley, and it is where most AI features built in the last two years live. The team measures aggregate accuracy, watches the number cross some "good enough" threshold, and ships. The realized user experience does not improve monotonically with that number. Between roughly 60% and 85% accuracy, the product gets worse as it gets more accurate, because the cost of a wrong answer the user did not think to check exceeds the value of a right answer they no longer have to verify.
The team that ships at 70% without designing for the predictability problem is not shipping a worse version of a 95% product. They are shipping a different product entirely: one whose primary failure mode is silent.
Reliability Is Not a Single Number
The standard model of AI feature quality treats accuracy as monotonically good. Higher is better. The launch gate is "we got to N%, ship it." The dashboard reports a single number, the team optimizes against it, and decisions get made on the assumption that the curve from "bad" to "good" runs through the middle without changing shape.
It does not. There are at least three regimes, and the user experience inside each is qualitatively different.
In the low-accuracy regime (below ~50%), the user pattern is "verify everything, use the AI as a draft generator." The output is suggestion, not answer. Every result gets read with skepticism. The user retains their own judgment as the load-bearing component, and the AI is a labor-saver on the parts where they would have started from a blank page anyway. Failures here are cheap because the user was already going to check.
In the high-accuracy regime (above ~95%, sustained across the user's actual workload), the user pattern is "trust by default, sample-check periodically." The output is treated as answer, not suggestion. The rare failure is forgivable as a known edge case, the user learns the boundaries through experience, and trust is recoverable. Failures are visible but not destabilizing because the failure rate is low enough that the user's heuristic for "this might be wrong" still fires in roughly the right places.
In the uncanny zone (roughly 60–85%, depending on the domain and the visibility of failure), neither pattern is stable. Verification is too expensive to do every time when the answer is usually correct. Trust-by-default produces too many concentrated failures, and those failures are not randomly distributed — they cluster in the segments the model finds hard, which the user discovers only after being burned several times. The user oscillates between modes and ends up trusting neither the AI nor their own ability to judge when to trust it. This is the "I can't tell when to trust this" failure, and it is the failure mode the aggregate accuracy number is least equipped to surface.
The shape of the curve matters more than the value at any single point. A team that does not know which regime its feature is in does not know whether shipping the next 5-point accuracy improvement will move the user experience up, down, or sideways.
Why the Aggregate Accuracy Number Lies to You
Even setting aside the non-monotonic shape, the headline accuracy number lies in two specific ways that compound the uncanny-valley problem.
The first lie is that the average hides the segment. A model that is 70% accurate on average is rarely 70% accurate everywhere. It is usually 90% accurate on the easy segment and 40% accurate on a harder one — and the harder one is disproportionately likely to be the segment the team's most demanding customers live in. A credit-decisioning model with 90% overall accuracy can still systematically under-score one demographic; an aggregate metric that looks fair on each attribute alone can fail badly at the intersection. In an LLM feature this looks like "the assistant works great in English on short queries and falls off a cliff in Spanish on long ones," but the leaderboard says 78%, so the team ships.
The right discipline here is per-segment accuracy. Not as a fairness afterthought — as the primary view, with the global average relegated to a summary line. A heatmap of error rates across attribute pairs reveals the patterns the aggregate hides. If a 50% segment is sitting behind a 90% segment, the team needs to know before the segment-50% users do.
The second lie is that accuracy is measured on the eval set, not the workload. The eval suite is curated, filtered, and biased toward the queries the team thought to write down. The realized workload includes the long tail the eval suite missed: the user who pastes in 8K of context, the customer who phrased the question in a way no synthetic example covers, the case where the retriever returned the wrong document. The realized accuracy on the long tail is almost always worse than the eval-set number, and the user experience is set by the long tail, not by the median. The team that confuses "78% on the eval suite" with "78% in production" is making a category error.
Why Users Cannot Detect Miscalibration on Their Own
There is a tempting move at this point: if the problem is that the user cannot tell when to trust the output, push the responsibility for trust calibration onto the user. Tell them the model is "sometimes wrong" in the onboarding flow, add a disclaimer at the bottom of every response, and let them figure it out.
This does not work, and the research on automation bias is unambiguous about why. Users in human-AI collaboration consistently fail to detect miscalibration; they over-rely on overconfident AI and under-rely on underconfident AI, and they do this even when shown the system's confidence scores. The pattern called "automation complacency" — the tendency to stop checking once the AI has been right several times in a row — is a stable feature of how humans interact with partially-reliable systems, not a defect of training or attention. It generalizes from the autopilot literature directly to LLM features.
The product implication is that the user cannot be the trust-calibration mechanism. If the team ships a 70%-accurate feature with confident-sounding outputs and trusts the user to verify the right ones, the user will verify almost none of them, will be wrong on a concentrated subset, and will lose trust in the feature in a way that is very hard to recover. The trust-calibration mechanism has to live in the product, not in the user's head.
That mechanism has roughly four moving parts, and they all have to be designed in before launch — not bolted on after the first round of churn.
Designing for the Uncanny Zone
The first part is confidence surfacing that the user can act on. Not a numeric score buried in a tooltip — those are almost universally ignored — but a UX pattern that makes "I am unsure about this" structurally different from "I am confident about this." A draft-mode banner. A different visual treatment for outputs the model flagged as low-confidence. An explicit "this answer used a source that may not be reliable" callout. The signal has to be loud enough to break the automation-complacency loop, which means it has to interrupt the user's default reading flow, not sit politely in a corner.
The second part is a default-to-assistance UX for any feature in the uncanny band. If the feature's measured accuracy is between 60% and 85% on the realized workload, the default mode is "draft for human review," not "answer." The user should have to click a button to commit the AI's output to a workflow, not have to click a button to review it. This inverts the cognitive load: the cost of inaction lands on the AI's output (it sits as a draft), not on the user's vigilance (they have to remember to check). The autonomy-spectrum literature calls this "human-in-the-loop with AI as second opinion" — humans make the primary decision while the AI alerts to potential issues, rather than the other way around.
The third part is per-segment dashboards that gate launches. The launch gate cannot be a single number. It has to include the worst-segment accuracy and the cardinality of users in that segment. A feature that is 78% globally and 45% on a segment that contains 8% of the user base is not ready, even if the global number meets the bar. The team needs to either lift that segment to baseline or scope the launch to exclude the segment, not ship a product whose failure mode is "the bottom 8% of users have a much worse experience and the support team will discover this from escalations." The Apple Card story is the canonical version of this failure outside LLMs; inside them, it tends to be locale, query length, or domain.
The fourth part is a launch discipline that explicitly classifies features by reliability regime. A feature targeted for the high-accuracy regime ships when it crosses 95% on the realized workload, with the worst segment above ~85%. A feature targeted for the assisted regime ships when its UX is designed for "draft and review" and its accuracy is high enough to make the draft useful (often 50–70% is plenty for this). A feature in between — 70% accuracy with a UX designed for "answer" — is the configuration to refuse to ship. Either the accuracy needs to come up or the UX needs to drop down, but the combination is the one that creates the trust collapse.
What This Looks Like in Practice
The teams who ship well in the uncanny zone are recognizable by a few habits.
They run a per-tenant or per-segment accuracy dashboard that surfaces the worst slice prominently, not buried under the global average. The eval suite is curated from realized workload (sampled production traffic) rather than synthetic examples, so the eval number tracks what users actually see. Confidence scores from the model are converted into UX signals at the boundary — by the rendering layer, not the model — so the product surface reflects the model's certainty rather than asking the user to read a probability and translate it.
They classify every AI feature by its target reliability regime in writing, and the launch gate references that classification. A feature labeled "assisted" can ship at lower accuracy because its UX is designed for review; a feature labeled "autonomous" cannot ship until the worst segment crosses the bar. The product manager's accountability is for the regime claim, not for hitting an arbitrary accuracy target — because shipping a feature whose accuracy claim does not match its UX claim is the failure they are trying to prevent.
They watch experiential metrics, not just output metrics. The number that tells you a feature is in trust collapse is not accuracy; it is something like "fraction of users who reverted from accepting AI output to ignoring it within their first ten interactions" or "rate of explicit thumbs-down on outputs the eval pipeline scored as correct." The users were trying to tell the team something the eval pipeline could not see, and the team that does not instrument for that signal will hear it first from churn.
The Failure Mode the Slide Deck Cannot See
The reason the 70% uncanny valley keeps catching teams off guard is that it is invisible from the perspective the org chart privileges. The accuracy number is up. The eval suite is passing. The model card looks better than last quarter's. The dashboard slide for the leadership review is green.
Meanwhile the user is forming a quiet, durable belief that the feature is unreliable in a way that they can no longer predict, and unreliability-you-can-predict is recoverable while unreliability-you-cannot is not. The user who learns "this AI is wrong about long Spanish queries" still uses it for short English ones; the user who learns "this AI is sometimes wrong about things I cannot identify in advance" stops using it for anything important. The first user is a customer for the next release. The second is churn the team will not see for two quarters and will misattribute when they do.
The actionable takeaway is that reliability is not a single number on a slide — it is the user's ability to predict when to trust the output. The teams who internalize this stop measuring features with one accuracy number and start measuring them with two: the accuracy itself and the user's verified ability to predict their own confidence in it. The teams who do not internalize it ship feature after feature into the uncanny zone, watch the dashboard go green, and wonder why the engagement curve flattens around the third week.
You cannot win the trust war by getting the model another five points more accurate. You can only win it by designing the product so that "I cannot tell when to trust this" is never the question the user has to answer.
- https://arxiv.org/html/2402.07632v4
- https://www.visible-language.org/journal/issue-59-2-addressing-uncertainty-in-llm-outputs-for-trust-calibration-through-visualization-and-user-interface-design/
- https://pmc.ncbi.nlm.nih.gov/articles/PMC8181412/
- https://www.tandfonline.com/doi/full/10.1080/10447318.2023.2301250
- https://link.springer.com/article/10.1007/s00146-025-02422-7
- https://www.aiuxdesign.guide/patterns/trust-calibration
- https://arxiv.org/html/2502.13321
- https://survey.stackoverflow.co/2025/ai/
- https://kpmg.com/xx/en/media/press-releases/2025/04/trust-of-ai-remains-a-critical-challenge.html
- https://fly.io/blog/trust-calibration-for-ai-software-builders/
- https://developers.google.com/machine-learning/crash-course/fairness/evaluating-for-bias
- https://arize.com/blog/evaluating-model-fairness/
- https://ilwllc.com/2025/12/balancing-ai-autonomy-human-oversight-with-adaptive-human-in-the-loop/
