Cost-Per-Correctness, Not Cost-Per-Token: The Unit Metric Your Bill Won't Tell You

April 28, 2026 · 11 min read

Software Engineer

A team I know cut their inference bill 40% last quarter by migrating their support-email triage flow from a frontier model to a mid-tier one. The CFO sent a thank-you note. Six months later, customer support headcount was up two FTEs and average resolution time had risen 35%. Nobody connected the dots, because the dots lived in different dashboards: the inference bill on the platform team's, the support load on the operations team's. The migration looked like a win on the only metric anyone was tracking. The metric was wrong.

This is the cost-per-token trap. Your invoice tells you what you spent on tokens. It cannot tell you what you spent per correct task, because the inference vendor has no idea what "correct" means in your domain. They sold you raw compute. You bought outcomes — or thought you did. The gap between those two units is where AI unit economics quietly comes apart, and the team that doesn't measure the right denominator is running half the equation and shipping the other half blind.

The Numerator Lies When the Denominator Is Missing

Cost-per-token is a clean number. The provider gives it to you on the bill, you can sum it across requests, and you can plot it month over month. It looks like a unit cost. It isn't. It's a gross cost — denominated in raw inference attempts, not in completed work.

The right unit for an AI feature is dollars-per-completed-correct-task. Once you write that fraction down, two dynamics show up that the token bill never surfaces:

A cheaper model that retries three times to land an acceptable answer is not actually cheaper. Three tokens-per-million invoices instead of one, plus three round-trips of latency, plus three chances for the user to abandon the flow.
An expensive model that completes the task in one pass is not actually expensive. One invoice. One round-trip. The "premium" is paid in tokens but recovered in throughput, in retry budget, and in the support tickets that don't get filed.

Industry analyses have started catching up. Recent breakdowns of LLM cost in production point out that the real cost formula is multiplicative — base tokens times cache multiplier times batch multiplier times reasoning multiplier times retry multiplier — and that silent retries alone can multiply token usage 2–5x without any change in user traffic. The team that only watches the headline rate sees a clean line. The team that watches per-task cost sees a feature that's quietly tripling its draw.

Why Vendors Can't Give You This Number

The provider knows what tokens they served. They do not know which of those tokens produced a correct answer in your domain. They cannot. Correctness is defined by your eval set — the gold dataset, the rubric, the human-or-LLM judge that says "this triage routed the email to the right queue" or "this code review caught the bug it was supposed to catch." That artifact lives in your repository, not theirs.

This is why cost-per-correctness is fundamentally a metric you have to build. There is no SaaS dashboard that hands it to you. The shape of the work is:

Per-request inference spend, joined by request ID to
Per-request graded outcome, where the grade comes from your eval pipeline (human review, LLM-as-judge calibrated against humans, deterministic check, or a hybrid)
Aggregated by feature so you can roll up "this AI feature cost $X this week and produced Y graded-correct tasks — therefore$ X/Y is the unit cost"

The plumbing is unglamorous. It's a join across two streams that historically belonged to two different teams: the platform team owns the inference logs, the eval team owns the gradings. The team that wants the right number has to make those two streams meet. The team that doesn't will keep optimizing whatever number the invoice provides — which is the wrong number.

The Migration That Looked Like a Win

The "migrate to a cheaper tier and watch the bill drop" pattern is now the most common AI cost-cutting motion in industry, and it's also the one most likely to silently regress quality if cost-per-correctness isn't being tracked. The pathology:

Q1: A feature ships on a top-tier model. Token spend is high. Finance flags it.
Q2: Platform team migrates the feature to a cheaper tier. Visible inference bill drops 40%. Engineering writes a celebratory post.
Q3: Support volume on AI-routed paths starts trending up. Tickets are vague — "the bot didn't help me," "it gave me the wrong answer." No single root cause.
Q4: Support headcount expands. CSAT dips on the AI-served cohort. Retention on that segment softens.
Q5: A separate analysis (often by a new hire who hasn't internalized the legacy story) connects the support spike to the migration. The savings are revealed to be net negative once support, churn, and brand-cost are priced in.

The structural reason this keeps happening: the inference savings are visible, denominated in dollars, and land on a single team's dashboard. The quality regression is distributed, denominated in support hours and churn, and lands on three other teams' dashboards. The org has no instrument to tie them together — until someone builds cost-per-correctness, and the migration retrospectively shows up as a regression on that line.

The reframing that has to land: moving a feature down a tier is not a cost decision. It is a quality-cost trade-off, and it has to be priced as one at design time. Sometimes the cheaper tier really is fine, the eval confirms it, and the savings are real. Sometimes the cheaper tier degrades correctness by 8 percentage points, the support team eats it, and the savings are illusory. The eval is the only thing that can tell you which case you're in before the experiment plays out in production.

Building the Per-Task Cost Ledger

The discipline that has to land is concretely:

A per-task cost ledger. Every inference call gets tagged with a task ID — the unit of work the user actually cares about. A "task" is whatever your product defines: an email triaged, a code review completed, a customer query resolved. One task may span multiple model calls (retrieval, generation, validation, retry). The ledger sums spend across all calls in the task and joins to the eval-graded outcome for that task. The unit cost rolls up against successful tasks, not gross requests.

A routing decision frame. When the platform team proposes a tier migration, the proposal includes the eval delta, not just the cost delta. A migration that saves $0.003/task but drops correctness from 94% to 86% is documented as a quality decision, not a cost win. The trade-off is made explicit at design time and re-reviewed when the feature's cost-per-correctness shifts.

A finance partnership model. The FP&A team sees cost-per-correctness as a first-class line item alongside cost-per-token. The former is what shows up in customer churn, support tickets, and renewal rates. The latter is what shows up on the invoice. A finance team that only sees the invoice will keep approving migrations that look like savings and produce hidden costs the finance team cannot see in their own data.

A quarterly review where rising cost-per-correctness triggers eval refresh, not cheaper-model migration. If the unit cost is climbing, the question is not "can we move to a smaller model?" The question is "is the eval still measuring the right thing? Has the task distribution shifted? Is a prompt change overdue?" Cheaper-model migration is the last lever, not the first.

This sounds like overhead. It is overhead — the kind of overhead that distinguishes a team running an AI feature as a product from a team running an AI feature as a demo that shipped.

What Cost-Per-Correctness Reveals That Cost-Per-Token Cannot

Once the ledger exists, it lights up trade-offs that are invisible on the token bill:

Retry economics. A flow that retries on low-confidence outputs may have a higher cost-per-request than a flow that doesn't. But if the retry-enabled flow has a 96% correctness rate against 78% for the no-retry version, the cost-per-correctness is dramatically better. The token bill makes retry look expensive. The unit cost reveals it as the cheap option.
Cache vs. fresh-call decisions. A semantically cached response that's 90% as good is cheap on tokens but might be measurably worse on correctness for high-stakes paths. The team without cost-per-correctness will keep ratcheting up the cache hit rate as a cost-saving win. The team with it will see the correctness floor and know when to draw the line.
Premium-tier ROI. The frontier model is more expensive per token. It may be cheaper per correct task if it eliminates retries, lowers human-review burden, and reduces downstream support load. Without the unit cost, the premium tier looks like an indulgence. With it, it can be the most efficient choice for high-stakes tasks.
Prompt-change attribution. A prompt revision that "looks similar" on token spend can shift correctness 5–10 points either direction. Cost-per-correctness catches the regression on the next ledger run; token spend masks it indefinitely.

The pattern: every architectural choice in an AI system is a quality-cost trade-off. The token bill prices half the trade. The eval prices the other half. Multiply, divide, and the unit cost is the only number that captures both.

When You Can Skip This (and When You Can't)

Not every AI feature needs a cost-per-correctness ledger on day one. Internal tools used by 30 employees are fine on token spend alone — the support cost of bad output is bounded by who is willing to file a ticket, and that's a small number. Throwaway prototypes with no evaluation don't need a denominator yet because the numerator hasn't earned the engineering investment.

The line where cost-per-correctness becomes load-bearing:

The feature serves end customers (not just employees). Quality regressions show up as churn, and you can't see churn on the inference bill.
The volume is high enough that token spend is a board-level number. Once finance is asking the question, the answer needs the right denominator or the next quarter's optimization motion will be wrong.
There is a cheaper-tier option being seriously considered. This is the highest-leverage moment to have cost-per-correctness in place. The migration decision will get made with or without the metric — better with.
The task has a quality grade you can produce. If you have an eval at all, you have the raw material. If you don't have an eval, build one before you build the ledger; the ledger without the eval is just a prettier token bill.

The Architectural Realization

AI unit economics is a graded-outcome problem with a cost numerator and an eval denominator. The team measuring only the numerator is not measuring unit economics — it's measuring inference operations, which is a different thing that happens to be denominated in the same currency. The conflation is easy to make and hard to fix once a quarter of decisions has been made on the wrong metric.

The teams that figure this out early end up with a strange superpower: they can let the inference bill grow when it should grow, and shrink it when it should shrink, with confidence that each move is correct on the unit they actually care about. Their cost-per-correctness numbers are stable or declining even as their token spend fluctuates with traffic. Their migrations land cleanly because the eval delta is priced in advance. Their finance team trusts them because the unit cost rolls up against business outcomes, not against a counter that nobody outside the platform team understands.

The teams that don't figure this out spend the next two years optimizing the wrong number, then spend the third year unwinding the optimizations after a churn analysis traces the damage back to the cost-cutting wins. By that point, the cheap migrations have ossified into the architecture, and reversing them costs more than the original "savings" ever produced.

The invoice you get from the inference vendor is not your AI cost. It is one input into your AI cost. The other input is what your eval says happened. Until you join those two streams, the number on the bill is precise, denominated, auditable — and incomplete in the only way that matters.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Cost-Per-Correctness, Not Cost-Per-Token: The Unit Metric Your Bill Won't Tell You

The Numerator Lies When the Denominator Is Missing

Why Vendors Can't Give You This Number

The Migration That Looked Like a Win

Building the Per-Task Cost Ledger

What Cost-Per-Correctness Reveals That Cost-Per-Token Cannot

When You Can Skip This (and When You Can't)

The Architectural Realization

Recommended Reading

About Tian Pan

The Numerator Lies When the Denominator Is Missing​

Why Vendors Can't Give You This Number​

The Migration That Looked Like a Win​

Building the Per-Task Cost Ledger​

What Cost-Per-Correctness Reveals That Cost-Per-Token Cannot​

When You Can Skip This (and When You Can't)​

The Architectural Realization​

Recommended Reading

About Tian Pan

The Numerator Lies When the Denominator Is Missing

Why Vendors Can't Give You This Number

The Migration That Looked Like a Win

Building the Per-Task Cost Ledger

What Cost-Per-Correctness Reveals That Cost-Per-Token Cannot

When You Can Skip This (and When You Can't)

The Architectural Realization