The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve
You're being charged for tokens that no user ever read. Not because of bugs, not because of vendor pricing tricks — but because your system is working exactly as designed, firing off background inference work that looked smart on a whiteboard but burns real budget on every request.
This is the shadow compute tax: the fraction of your inference spend that goes toward AI work that is speculative, premature, or structurally guaranteed never to reach a user. It's invisible in your dashboards until suddenly it isn't, and by then it's baked into your cost model as an assumption.
What Shadow Compute Actually Looks Like
Shadow compute isn't one thing. It's a category of failure modes that all share the same signature: compute consumed at time T for output that will be discarded, replaced, or never requested.
Proactive generation is the most common. A system detects that the user might ask a follow-up question and fires off an LLM call to pre-generate an answer. Sometimes this pays off — the user asks, the latency is near-zero. More often the user does something else, or asks a slightly different question, and the pre-generated text is thrown away. At any reasonable acceptance rate below 70%, proactive generation is a net-negative bet.
Background summarization compounds this. Agentic systems that accumulate conversation history will often trigger periodic summarization jobs to compress context before it hits the window limit. This is operationally sensible, but it fires on a timer or token count rather than on demand. In a session where the user is actively reading and has three turns left, the summarization runs anyway — and the summary you just generated will itself be part of the context you summarize next cycle.
Eager context preparation happens when RAG pipelines or tool-calling agents pre-fetch document chunks, run retrieval, or assemble large context windows before the actual query scope is known. You end up with 8,000 tokens of retrieved context, 5,000 of which are structurally irrelevant because the user's intent wasn't what you assumed.
Speculative decoding at the wrong acceptance rate is a subtler case. Speculative decoding is genuinely effective — a small draft model generates candidate tokens that a larger target model verifies in one forward pass, achieving 2–3x throughput on predictable outputs. But when the acceptance rate falls below 50%, the cost of drafting plus verification exceeds the cost of just running the large model directly. Creative writing tasks routinely sit at 0.5–0.65 acceptance rates. Systems that use speculative decoding universally, without per-task calibration, are paying the draft model overhead on every token that gets rejected.
The Scale of the Problem
Enterprise LLM API spending roughly doubled between late 2024 and mid-2025, from around 8.4B. That trajectory is partially legitimate growth — more features, more users — but a significant fraction of it is waste that compounds because no one is measuring it directly.
The clearest evidence comes from what happens when teams actually audit their inference usage. Routing simple queries to smaller models while reserving expensive frontier models for complex reasoning cuts costs by up to 85% (Berkeley, 2024). Semantic caching eliminates up to 73% of redundant API calls. Anthropic's prefix caching, properly implemented, cuts input token costs by 90% for long prompts. These numbers aren't marginal. They indicate that for many teams, the majority of inference spend is not proportional to the value being delivered.
The fan-out problem makes this worse in agentic systems. A single user action in a multi-agent workflow can trigger 15–40 LLM calls when you account for strategies, retries, judges, improvement loops, and fallback chains. Most of these calls are legitimate when individually considered. The system is doing what it was designed to do. But the aggregate cost-to-value ratio for any given user outcome is rarely measured, and it's often terrible.
There's also the trajectory accumulation problem specific to long-running agents. Tool call outputs, intermediate results, and prior assistant turns accumulate in the context window and stay there until task completion. By some measurements, as much as 99% of the tokens being processed by production agentic systems are input history tokens — context the model reads every step — while only 1% are newly generated. You're paying to re-read your own chat log on every turn.
Measuring Waste Before It Measures You
The foundational metric is cost per successful user outcome, not cost per token. This sounds obvious but most teams optimize cost per token because that's what the API invoice shows. Cost per user outcome requires instrumenting the full path from inference call to actual user interaction.
Start with three proxies that don't require deep instrumentation:
Cache hit rate. For systems with prompt caching enabled, a hit rate below 50% indicates structural problems. Cached tokens cost roughly 10x less than uncached tokens on most providers. Best-in-class agentic systems hit 85–87% cache hit rates through careful prompt prefix design. If you're well below that, a large fraction of your spend is on redundant computation.
Speculative work acceptance rate. For every category of background work — proactive generation, pre-fetched context, speculative summaries — track what percentage of that output was actually used. If you can't instrument this directly, proxy it with the abandonment rate: how often does a user session end without consuming any output from a background call that fired during that session?
Fan-out ratio. Count the number of LLM calls per user-visible output. A ratio above 5 warrants investigation. A ratio above 15 almost certainly contains work that could be deferred or eliminated. Tag each call with a reason code and aggregate — you'll usually find that two or three call categories account for the majority of spend, and at least one of them is speculative.
Trigger Design That Doesn't Waste Budget
The goal isn't to eliminate speculative work. Pre-fetching and proactive generation have real value when they're right. The goal is to stop firing them unconditionally.
Probability-gated triggers. Before firing a background inference call, estimate the probability that its output will be used. If you can't estimate this, start with a conservative threshold (60–70%) and collect data. A background summarization job that fires when a session has 80% of context window used is rational. One that fires every 10 turns regardless of remaining capacity is not.
Demand-shaped pre-fetching. Rather than pre-generating outputs for likely next states, pre-load the inputs. Fetch documents, assemble context, warm the cache — but don't generate tokens until you know what's being asked. This captures the latency benefit (the expensive I/O is done) without paying inference cost speculatively.
Acceptance-rate-aware speculative decoding. Speculative decoding should be task-conditioned, not universal. For deterministic, structure-heavy tasks (JSON generation, code completion, templated outputs), high acceptance rates make it a clear win. For open-ended generation, disable it or use an extremely conservative draft budget. vLLM and TensorRT-LLM both support per-request configuration.
Lazy context compaction. When context is approaching window limits, prefer deletion over summarization. Verbatim deletion of older, less-relevant turns is computationally cheap and avoids the dual cost of generating a summary and then including that summary in every subsequent call. If summarization is genuinely needed for coherence, batch it: don't summarize after every N turns; summarize once when you have enough history to make a materially useful summary.
The Billing Signal You're Probably Missing
Shadow compute hides because inference cost attribution is coarse. You see total tokens per model per day. You don't see total tokens per speculative call category per user outcome.
The signal that something is wrong almost always shows up first in cost velocity, not absolute cost. A 40%+ month-over-month increase in inference spend without a corresponding increase in user-visible feature usage is the clearest indicator that background work is proliferating. This is especially common after agentic features ship — the feature adds value, teams get positive feedback, and no one notices that the feature is also firing five background calls per session that weren't in the original design.
Set per-category spend budgets, not just aggregate budgets. Tag every LLM call with a category — proactive, background-summary, speculative-context, user-requested, retry — and alert when any non-user-requested category exceeds a threshold fraction of total spend (10% is a reasonable starting limit). This surfaces the waste explicitly without requiring deep changes to how the system works.
The deeper structural fix is to treat inference budget as a resource allocation problem, not just a cost problem. Every background AI task is making a bet: it's spending N tokens now against a probability P of delivering value V. When N × price > P × V, the task is net-negative and should either not fire or fire less frequently. Running this math explicitly for each category of background work is what separates teams that control their inference costs from teams that are surprised by them.
What to Do This Week
If you haven't audited your fan-out ratio, do that first. Pull a sample of 1,000 sessions, count LLM calls per session, and categorize them. This usually takes half a day and almost always reveals at least one category of speculative work that no one knew was firing.
If your cache hit rate is below 70%, fix prompt prefix design before investing in anything else. Stable prefixes that don't change between requests are the single highest-leverage optimization in most systems — they cost nothing to implement and the reduction in compute is immediate.
If you have background summarization running on a timer, change it to run on demand (user session end, explicit save) or replace it with context window management through selective deletion. The summarization tax — paying to read history you're about to discard — is often the largest single source of shadow compute in conversational systems.
The shadow compute tax is a design choice, not a vendor problem. You pay it when background work fires unconditionally, when speculative outputs are generated without tracking their acceptance rate, and when fan-out is treated as an engineering detail rather than an economic one. The good news: it's entirely recoverable once you can see it.
- https://www.cloudzero.com/blog/inference-cost/
- https://leanlm.ai/blog/llm-cost-optimization
- https://blog.premai.io/speculative-decoding-2-3x-faster-llm-inference-2026/
- https://www.tensormesh.ai/blog-posts/agent-skills-caching-cacheblend-llm-cache-hit-rates
- https://arxiv.org/html/2509.23586v2
- https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
- https://arxiv.org/pdf/2509.21361
- https://openreview.net/pdf?id=n4V3MSqK77
- https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://arxiv.org/html/2604.25724
- https://eval.16x.engineer/blog/llm-context-management-guide
