The Success Metric That Improved Because the Model Declined the Hard Cases

June 1, 2026 · 9 min read

Software Engineer

You bumped the model on Tuesday. By Friday, the "task completion rate" dashboard had climbed from 71% to 78%. Leadership noticed. Someone screenshotted it for the all-hands. Two weeks later, support quietly flagged that churn on a specific cohort of complex tickets had doubled. Nobody connected the two events because, on paper, the agent got better. In reality, the new model just got better at refusing.

This is the metric decoupling problem, and it is one of the most expensive ways an LLM-powered product can deceive its own builders. Your success rate did not measure what you thought it measured. It measured the intersection of what the model attempted and what the model got right when it attempted. When a model upgrade, a prompt change, or a safety-tuning pass shifts the boundary of "attempted," your numerator and your denominator move together — and the ratio can go up even as user-perceived quality falls off a cliff.

The Refusal Calibration Nobody Told You About

Every frontier model release ships with a recalibrated refusal surface. Anthropic's own Claude Opus 4.5 system card noted a "minor uptick" in refusal rates relative to 4.1, consistent across languages tested. That's a sentence written for transparency, and it sounds harmless. It is not harmless if your product's success metric does not distinguish "the model answered correctly" from "the model declined to answer at all."

The shift is rarely advertised in capability benchmarks. Benchmarks measure performance on cases the model attempts. The cases the model stops attempting after a safety pass slide silently out of the numerator and the denominator alike. A targeted RLHF update for one failure category does not behave like a surgical edit — it shifts behavior across a distribution. The model learns "this kind of question is risky" and generalizes that signal to neighboring questions whose risk profile is, in your product context, much lower.

So when you swap from version N to version N+1, three things happen at once:

Easy cases: attempt rate roughly constant, success rate roughly constant or slightly up.
Borderline cases: attempt rate drops, and the cases that were dropped were disproportionately the ones the old model used to attempt-and-fail.
Hard cases your users actually pay for: attempt rate drops the most, replaced by a polite "I can't help with that" that your instrumentation tags as no_error.

Aggregate over all three and the success rate ticks up. The graph is technically true. The product is worse.

Why "Success Rate" Is the Wrong Single Number

The literature on selective prediction has been clear about this for almost a decade. A classifier that can abstain has two performance axes, not one: coverage (the fraction of inputs predicted on) and risk (the error rate on the inputs it did predict on). The complete performance profile is a curve, not a number. Reporting accuracy without reporting coverage is reporting half a measurement.

LLM products inherited this property the moment refusals became part of the response distribution. A single-number "success rate" on a stochastic system with a variable refusal boundary is not a metric — it is a story the underlying behavior can rewrite without the dashboard noticing. The number can climb because the model got smarter. It can climb because the model got more cautious. It can climb because users self-selected toward easier requests after watching the model decline harder ones twice. From the dashboard alone, you cannot tell which.

The customer-support world has already learned this lesson in expensive form. "Deflection rate" — tickets the AI handled without escalation — looks like a savings metric until you audit the deflected tickets. Industry audits of RAG-based support deployments find 15–25% of "deflected" tickets were closed with incorrect or incomplete answers. The customer left. The problem did not. The CSAT graph, lagged by thirty days, tells the real story; the deflection graph, in real time, lies cheerfully.

Resolution rate — confirmed problem solved — is the metric that actually correlates with retention. Deflection rate is the metric that correlates with whatever the model felt like doing that week.

The Metric Decomposition That Catches It

The fix is not subtle, but it requires re-instrumentation. Treat success as the product of two independently graphed quantities:

Attempt rate per task class: the fraction of requests the agent engaged with, as opposed to declining, deferring, or escalating.
Success-given-attempt: the fraction of attempted requests that produced a correct, complete outcome.

Headline success rate becomes the product of those two, and any movement in the top line is now explicable by which factor moved. A model upgrade that ships with a refusal recalibration will move attempt rate down and success-given-attempt up. The product can still rise. That movement should now generate an alert, not a celebration.

Per-task-class slicing matters here. A global attempt rate is a single number with a global average problem: a 2-point drop across the whole distribution could be uniform, or it could be a 20-point drop concentrated in your highest-revenue cohort. Slice by task type, customer segment, and request complexity tier. The decoupling of attempt rate from success-given-attempt is itself class-conditional — borderline-safety classes are the first to move, technical-difficulty classes drift over multiple releases, and the long tail of unusual requests moves invisibly because nobody is watching that bucket.

A regression test should assert: attempt rate cannot drop by more than X% on any tracked class without a deliberate calibration change being recorded. The assertion is structural — it does not require predicting which classes will move; it requires the team to acknowledge any movement before it ships.

The User-Impact Layer Your Eval Suite Doesn't Have

Even attempt-rate + success-given-attempt is upstream of what actually matters. Both metrics are scored from the agent's point of view: did I attempt, did I succeed. The user's question is different: did I get what I needed.

A user whose request was politely declined did not get what they needed. Your agent's instrumentation marked the turn as no_error because nothing crashed and no policy was violated. Your eval suite never flagged the case because the response was internally consistent. The user's experience is captured nowhere on your dashboards because user-impact is a layer your stack does not natively emit.

The pattern that closes the gap is a user-outcome metric that lives outside the agent's self-report. Plausible implementations:

A follow-up signal: did the user re-ask within N minutes, switch channels to a human, or abandon the session within the next K turns?
A satisfaction probe: a single-question post-turn survey scored independently of whether the agent attempted the task.
A reconciled audit: a sampled re-judgment of declined turns to classify "correctly declined" vs "should have attempted." This is the abstention-rate analogue of false-positive auditing, and it is the only way to distinguish appropriate refusal from over-refusal.

The reconciled audit is the most expensive of the three and the most informative. It is the equivalent of selective-classification's risk-coverage curve translated into product terms: at this refusal threshold, what is the error rate of the refusals themselves? Most teams never measure this and could not tell you whether their agent over-refuses by 5% or by 50%. The number exists. Nobody has counted it.

Why Leadership Reads the Wrong Graph

The disconnect between the success-rate dashboard and the churn dashboard is rarely a measurement bug. It is an organizational gap. The success-rate graph lives in the agent team's monitoring stack. The churn graph lives in the customer success team's spreadsheet, lagged by weeks. They are not on the same screen. They are not reviewed at the same meeting. They do not name the same model version as their independent variable.

Leadership reads the success rate. It is real-time, it has a number, and the number is going up. The implicit conclusion is that the last change worked, so ship more like it. The team that runs eval ships another prompt change or model upgrade. The number goes up again. The graph encourages an escalation toward whichever direction maximizes the metric — and the direction that maximizes the metric, in the presence of refusal calibration drift, is the direction of more refusal.

This is the agent-system version of the McNamara fallacy: optimize what you can measure, then assume what you cannot measure is not important. The unmeasured thing here is not a soft variable like "vibes" — it is a hard variable like "did the user get what they needed," obscured because the model was given the right to decline and the dashboard was not given the right to count declines as failures.

A management discipline that survives this requires three things on the same screen as headline success rate: attempt rate per class, abstention rate per class, and a user-outcome signal. Anything less, and the leadership conclusion drawn from the dashboard will be wrong in a direction that compounds over releases.

What This Costs Per Release

Every model upgrade and every prompt change is a refusal-boundary perturbation whether the team intended it or not. The cost of not measuring it is not a single missed regression — it is a slow drift in the agent's helpfulness over a year of "improvements," each of which made the success-rate dashboard look better and the user experience worse.

The teams that do measure it inherit a more honest engineering process. They can run a model upgrade as a deliberate trade: this release improves success-given-attempt by 3 points and costs us 4 points of attempt rate on the legal-review class — do we want it? That conversation has the right shape. The conversation that ends in "task completion rate went up, ship it" has the wrong shape and will eventually be paid for by a churn cohort nobody can re-acquire.

The architectural realization is simple and unpleasant: the moment your model can refuse, your success rate is no longer measuring what you thought. It is measuring the joint distribution of capability and willingness, and willingness is a moving target that changes with every release the provider ships. Treat it as a moving target. Decompose the metric. Audit the refusals. Put the user-outcome signal next to the agent-self-report signal on the dashboard the team reviews on Monday.

The graph that goes up because the model declined harder cases is the most dangerous kind of green. It looks like progress. It is the shape of a product quietly losing the customers it most wants to keep.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Success Metric That Improved Because the Model Declined the Hard Cases

The Refusal Calibration Nobody Told You About

Why "Success Rate" Is the Wrong Single Number

The Metric Decomposition That Catches It

The User-Impact Layer Your Eval Suite Doesn't Have

Why Leadership Reads the Wrong Graph

What This Costs Per Release

Recommended Reading

About Tian Pan

The Refusal Calibration Nobody Told You About​

Why "Success Rate" Is the Wrong Single Number​

The Metric Decomposition That Catches It​

The User-Impact Layer Your Eval Suite Doesn't Have​

Why Leadership Reads the Wrong Graph​

What This Costs Per Release​

Recommended Reading

About Tian Pan

The Refusal Calibration Nobody Told You About

Why "Success Rate" Is the Wrong Single Number

The Metric Decomposition That Catches It

The User-Impact Layer Your Eval Suite Doesn't Have

Why Leadership Reads the Wrong Graph

What This Costs Per Release