Skip to main content

The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task

· 10 min read
Tian Pan
Software Engineer

A quarter into the rollout, the AI team reported a 25% reduction in average cost-per-API-call. The support team reported that average handle time on AI-routed tickets had drifted from four turns to seven. Both numbers were correct. Both teams were measuring the system they had been told to optimize. The finance team, sitting between them, could not reconcile the dashboards because neither one was denominated in the thing the customer was actually paying for: a resolved ticket. The cost-per-call had gone down. The cost-per-resolved-task had gone up 40%. Nobody owned that number, so nobody was watching it move.

This is the most common unit-economics failure I see in agentic deployments, and it is not a measurement bug. It is a definitional one. The vendor's pricing page exposes cost-per-call because that is the unit they bill. The spreadsheet line item inherits that unit because it fits in a cell. The engineering team optimizes against the unit they were given. By the time the gap between API economics and business economics becomes visible, it has been compounding for a quarter, and the agent has been quietly trained on the wrong loss function the entire time.

The vendor SKU is not the unit of work

The unit of work the customer pays for is rarely the unit the model provider bills for. A customer pays for a resolved ticket, an accepted suggestion, a completed booking, a generated brief that ships without rewrite. The provider bills for tokens, or seats, or model calls. These are not the same unit, and the conversion ratio between them is the thing that determines whether the agent has positive unit economics.

The naive calculation that gets put into the business case usually looks like this: average tokens per ticket times price per token equals cost per ticket. In practice that number is wrong by a factor of three to eight. A realistic support resolution that requires two or three tool calls fires five to eight LLM inferences, each of which carries the accumulated conversation context. By turn seven the input token count on each call has tripled from turn one. A session that runs twice as many turns can easily cost three or four times as much, because later turns are more expensive per turn than earlier turns. None of that shows up in the cost-per-call dashboard, because cost-per-call is an average and the distribution is long-tailed.

The pattern is that the vendor's billing schema is a shape, and the team's optimization target inherits that shape. If the shape is per-token, the team optimizes for shorter outputs. If the shape is per-call, the team optimizes for fewer calls. If the shape is per-seat, the team optimizes for higher seat utilization. None of those targets is necessarily aligned with the unit the customer is paying for, and in many deployments at least one of them is actively misaligned. The shorter output might be the one that leaves the user with a follow-up question. The fewer calls might be the ones that skipped a tool the resolution needed. The higher seat utilization might be the agent handling more tickets per seat at lower per-ticket quality.

What goes in the numerator that nobody puts in

When teams do try to compute cost-per-resolved-task, the second mistake is usually the numerator. The instinct is to count only the tokens that produced the resolution: the successful path, the accepted output, the call that closed the ticket. Everything else — the abandoned conversations, the failed tool calls that triggered retries, the timeouts, the runs that escalated to a human after consuming half a model's worth of context — gets bucketed as overhead or quietly ignored.

The correct numerator is total fully loaded spend on that workflow over the period, including every failed attempt, every retry, every abandoned session, every escalation, and every shadow-mode evaluation the workflow triggered. The denominator is accepted outcomes only. A run that consumed forty thousand tokens and ended in an escalation contributes to the numerator and not to the denominator. So does a run that the user abandoned at turn nine. So does a run that an internal eval flagged as low-quality and re-ran. The result is a number that looks, at first, alarmingly high, and that is the point. The first time a team computes cost-per-accepted-outcome honestly, the number is typically three to eight times what the API math suggested. That gap is the cost of every path that did not end in the business getting what it paid for.

A useful refinement is to break the numerator out by failure mode. Tag each run with one of a small set of outcome states — accepted, rejected, abandoned, timed out, tool error, escalated — and attribute its cost to a bucket. Now you can report a Failure Cost Share alongside cost-per-outcome: the percentage of total workflow spend that produced no business-acceptable result. When Failure Cost Share moves, it tells you which class of failure is driving the unit economics this quarter, and the optimization conversation shifts from "make tokens cheaper" to "make this specific failure mode rarer."

The optimization loop you accidentally trained

Once cost-per-call is the metric on the wall, the engineering team's optimization loop adapts to it. Within a quarter, the optimizations that get shipped are the ones that move that number. Shorter prompts, more aggressive caching, smaller models on the orchestrator, fewer tool calls per turn. Each of these is a real engineering choice with real trade-offs, but the trade-offs are not visible in the cost-per-call metric — they show up downstream, in resolution rate, in turn count, in escalation rate, in CSAT. Those are different dashboards, owned by different teams, and they move on different cadences.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates