The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve

May 4, 2026 · 9 min read

Software Engineer

You're being charged for tokens that no user ever read. Not because of bugs, not because of vendor pricing tricks — but because your system is working exactly as designed, firing off background inference work that looked smart on a whiteboard but burns real budget on every request.

This is the shadow compute tax: the fraction of your inference spend that goes toward AI work that is speculative, premature, or structurally guaranteed never to reach a user. It's invisible in your dashboards until suddenly it isn't, and by then it's baked into your cost model as an assumption.

What Shadow Compute Actually Looks Like

Shadow compute isn't one thing. It's a category of failure modes that all share the same signature: compute consumed at time T for output that will be discarded, replaced, or never requested.

Proactive generation is the most common. A system detects that the user might ask a follow-up question and fires off an LLM call to pre-generate an answer. Sometimes this pays off — the user asks, the latency is near-zero. More often the user does something else, or asks a slightly different question, and the pre-generated text is thrown away. At any reasonable acceptance rate below 70%, proactive generation is a net-negative bet.

Background summarization compounds this. Agentic systems that accumulate conversation history will often trigger periodic summarization jobs to compress context before it hits the window limit. This is operationally sensible, but it fires on a timer or token count rather than on demand. In a session where the user is actively reading and has three turns left, the summarization runs anyway — and the summary you just generated will itself be part of the context you summarize next cycle.

Eager context preparation happens when RAG pipelines or tool-calling agents pre-fetch document chunks, run retrieval, or assemble large context windows before the actual query scope is known. You end up with 8,000 tokens of retrieved context, 5,000 of which are structurally irrelevant because the user's intent wasn't what you assumed.

Speculative decoding at the wrong acceptance rate is a subtler case. Speculative decoding is genuinely effective — a small draft model generates candidate tokens that a larger target model verifies in one forward pass, achieving 2–3x throughput on predictable outputs. But when the acceptance rate falls below 50%, the cost of drafting plus verification exceeds the cost of just running the large model directly. Creative writing tasks routinely sit at 0.5–0.65 acceptance rates. Systems that use speculative decoding universally, without per-task calibration, are paying the draft model overhead on every token that gets rejected.

The Scale of the Problem

Enterprise LLM API spending roughly doubled between late 2024 and mid-2025, from around $3.5B to$ 8.4B. That trajectory is partially legitimate growth — more features, more users — but a significant fraction of it is waste that compounds because no one is measuring it directly.

The clearest evidence comes from what happens when teams actually audit their inference usage. Routing simple queries to smaller models while reserving expensive frontier models for complex reasoning cuts costs by up to 85% (Berkeley, 2024). Semantic caching eliminates up to 73% of redundant API calls. Anthropic's prefix caching, properly implemented, cuts input token costs by 90% for long prompts. These numbers aren't marginal. They indicate that for many teams, the majority of inference spend is not proportional to the value being delivered.

The fan-out problem makes this worse in agentic systems. A single user action in a multi-agent workflow can trigger 15–40 LLM calls when you account for strategies, retries, judges, improvement loops, and fallback chains. Most of these calls are legitimate when individually considered. The system is doing what it was designed to do. But the aggregate cost-to-value ratio for any given user outcome is rarely measured, and it's often terrible.

There's also the trajectory accumulation problem specific to long-running agents. Tool call outputs, intermediate results, and prior assistant turns accumulate in the context window and stay there until task completion. By some measurements, as much as 99% of the tokens being processed by production agentic systems are input history tokens — context the model reads every step — while only 1% are newly generated. You're paying to re-read your own chat log on every turn.

Measuring Waste Before It Measures You

The foundational metric is cost per successful user outcome, not cost per token. This sounds obvious but most teams optimize cost per token because that's what the API invoice shows. Cost per user outcome requires instrumenting the full path from inference call to actual user interaction.

Start with three proxies that don't require deep instrumentation:

Cache hit rate. For systems with prompt caching enabled, a hit rate below 50% indicates structural problems. Cached tokens cost roughly 10x less than uncached tokens on most providers. Best-in-class agentic systems hit 85–87% cache hit rates through careful prompt prefix design. If you're well below that, a large fraction of your spend is on redundant computation.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve

What Shadow Compute Actually Looks Like

The Scale of the Problem

Measuring Waste Before It Measures You

Recommended Reading

About Tian Pan

What Shadow Compute Actually Looks Like​

The Scale of the Problem​

Measuring Waste Before It Measures You​

Recommended Reading

About Tian Pan

What Shadow Compute Actually Looks Like

The Scale of the Problem

Measuring Waste Before It Measures You