The Model Bill Is 30% of Your Inference Cost

April 23, 2026 · 8 min read

Software Engineer

A finance lead at a mid-sized AI company told me last quarter they had "optimized their LLM spend" by switching their agent backbone from Sonnet to Haiku. The token bill dropped 22%. The total inference cost per resolved ticket went down 4%. When we pulled the full decomposition, the model line item was roughly a third of the per-request cost. Retrieval, reranking, observability, retry amplification, and the human-in-the-loop review queue ate the rest — and none of those got cheaper when they swapped models.

This is the most common accounting error I see in AI teams right now. Token cost is the line item on the invoice you pay every month, so it becomes the number everyone optimizes. But for any non-trivial production system — RAG, agents, anything with tool use or evaluation gates — the model inference is often 30 to 50% of the real unit economics. The rest sits in places your engineering dashboard doesn't surface and your finance team doesn't categorize as "AI spend."

If you only audit the model bill, you can cut it in half and leave 60% of your actual cost structure completely untouched. Worse, some of the moves that lower token spend — cheaper model, tighter prompt, aggressive truncation — actively increase the other 70% by driving up retries, human review volume, or retrieval churn. The saving is a mirage priced in the wrong currency.

What the other 70% actually is

A realistic per-request cost decomposition for a production RAG or agent system looks roughly like this, with wide variation by workload:

Model inference (input + output tokens): 25–45%
Context assembly (embedding, vector search, reranking): 10–25%
Observability and trace retention: 5–15%
Retry, fallback, and self-correction loops: 5–20%
Human review and escalation labor: 10–30%
Guardrails, safety filtering, and eval sampling: 5–15%

Each of these categories has its own failure mode that amplifies cost without showing up in your LLM provider dashboard.

Context assembly is the sleeper line item. Every query pays for an embedding, a vector search, typically a second-stage reranker, and sometimes a dedicated retrieval LLM call to rewrite the query. A pipeline that retrieves 50 candidates, reranks to 10, and stuffs 6,000 tokens of context into the prompt is paying embedding fees, vector database reads, a reranker API call or GPU time, and then the tokens — before the model ever generates a response. Teams routinely underestimate this at 40 to 60% of true infra spend because it shows up split across Pinecone or Weaviate bills, Cohere or Voyage rerank fees, and a handful of internal services.

Observability costs scale with trace volume, not request volume. A single agent run with 8 tool calls and 3 reasoning steps can produce 15 to 30 spans. At 5–50 KB per span and millions of requests per day, a team running Langfuse, Arize, or a homegrown ClickHouse setup can easily spend $50K to $100K per month on trace storage and query compute. The retention policy matters: compliance teams want a year, analytics wants six months, debugging wants 30 days. Most teams pick the longest and pay for all of it.

Retry amplification is where agent systems quietly break the math. A 3% tool error rate with exponential backoff and a fallback model sounds benign. At three tool calls per request with independent retry attempts, it compounds to a noticeable fraction of requests paying 2x or 3x model cost. When the fallback is a more expensive reasoning model, a single cascaded retry can cost more than ten successful base-case calls. If you haven't capped retries and budgeted fallbacks explicitly, your cost variance is wider than your mean.

Human review is the category finance hates because it's labor, not API spend. But if your system flags 20% of outputs as low-confidence and routes them to reviewers at $30 to $60 per hour loaded cost, with 2 to 5 minutes per review, you're adding $0.10 to $0.50 per escalated request — which, for many consumer-facing applications, dwarfs the model cost entirely. A working AI FinOps discipline counts this against the AI line item. Most don't, which is why model cost keeps winning optimization budget while the real cost driver is idle reviewers.

Why "we switched to Haiku and saved 20%" is almost always wrong

The Haiku-switch story is a clean case study because it touches every other cost lever the wrong way. Smaller models follow instructions less tightly, so the prompt grows to compensate — more few-shot examples, more explicit constraints, longer system prompts. Context inflation is a multiplier on retrieval and reranking, because the cheaper model is worse at ignoring irrelevant chunks so retrieval has to be more precise (or the prompt has to absorb more candidates).

Smaller models also tend to have worse calibration on their own uncertainty, which means confidence-gated review catches more of their outputs — so the human review bucket grows. Retry and fallback patterns often silently upgrade to a larger model on second attempt, meaning the "savings" only hold for the happy path and reverse on the long tail.

I've seen teams cut their token bill by 30% and lift total cost per resolved task by 8%, because the downstream effects swamped the direct savings. The right comparison is never "tokens per request" or "dollars per million tokens." It's cost per successfully resolved task, amortized across retries, reviews, and the retrieval chain. That number is almost never on any dashboard by default.

The four levers that actually move the real bottom line

Once you have the full decomposition, the optimization priority reorders itself. The moves that matter, roughly in order of leverage for most RAG and agent workloads:

1. Cache upstream of the model. Prompt caching at the provider level handles repeated system prompts and stable context prefixes — Anthropic's prompt caching and OpenAI's automatic caching both routinely deliver 30 to 90% reductions on the cached portion. Semantic caching one layer up handles duplicate and near-duplicate user queries, and roughly 30% of queries in production workloads show semantic similarity to prior requests. A well-placed semantic cache bypasses the model, the retrieval pipeline, and the eval pass entirely. It is the single highest-leverage optimization available to most teams and it goes after multiple cost buckets at once.

2. Prune retrieval before the model sees it. Retrieving 50 candidates and letting the model sort it out is the most expensive way to answer a question. A good reranker plus a tight top-k (often 3 to 5, rarely 10+) drops context tokens by 60 to 80% and usually improves answer quality, because the model isn't distracted by marginally relevant chunks. The embedding cost on the rerank side is real but small compared to the token savings from not stuffing the main prompt. Measure it: your retrieval stage should know its own precision@k on an eval set, and your answer quality should be graphed against context size.

3. Gate human review on real confidence, not model confidence. Most teams use the model's self-reported certainty or a heuristic like "did it cite sources" to decide what goes to review. Both are noisy. A small, purpose-trained classifier — logistic regression on top of the retrieval score distribution, answer length, self-consistency across samples, and retrieval overlap — often outperforms the model's own confidence by a wide margin, and it runs for fractions of a cent. Getting review volume from 20% to 5% with preserved error rates is routinely worth more than any model swap.

4. Set trace retention by usage tier. Not every span needs 12-month retention. Sample-mode production traces at 10% with full retention, keep error traces at 100% with 90-day retention, and archive the rest to cold storage after 30 days. The storage cost curve is superlinear in retention period, and the analytical value of month-old successful traces is usually near zero. A tiered policy can cut observability cost by 60 to 80% without affecting debugging or compliance coverage.

What a real FinOps view looks like

The cost report that actually guides decisions has a row per request category — "answered from cache," "RAG resolved," "escalated to review," "failed and retried," "refused" — and a column per cost bucket — model tokens, context assembly, observability, retry overhead, human labor. You can't build this from the provider's billing portal. You have to instrument it at the application layer, with spans tagged by category and cost attribution joined from multiple sources.

The first time a team builds this, two things usually happen. First, someone discovers that 8% of requests are driving 40% of cost, almost always due to retry cascades on a specific failure class. Second, the CFO has questions about why the human review bucket is line-itemed as "AI spend." Both conversations are productive and neither would have happened from staring at the LLM invoice.

The alternative is the Haiku story on repeat — chasing the only number you can see, and leaving the actual cost structure unaudited. In a year when token prices are falling 10x and compute spend is still climbing, that math breaks fast. The teams that win the unit economics battle in 2026 are the ones who treated "model cost" as one input into a real decomposition, and optimized the whole stack instead of the one line their vendor happened to print for them.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Model Bill Is 30% of Your Inference Cost

What the other 70% actually is

Why "we switched to Haiku and saved 20%" is almost always wrong

The four levers that actually move the real bottom line

What a real FinOps view looks like

Recommended Reading

About Tian Pan

What the other 70% actually is​

Why "we switched to Haiku and saved 20%" is almost always wrong​

The four levers that actually move the real bottom line​

What a real FinOps view looks like​

Recommended Reading

About Tian Pan

What the other 70% actually is

Why "we switched to Haiku and saved 20%" is almost always wrong

The four levers that actually move the real bottom line

What a real FinOps view looks like