Your AI Product's Dark Energy: The Background Compute Nobody Budgeted
When your AI feature ships, you build a latency budget: how long does the model call take, how long does retrieval take, what's the p99 for the full request. What you almost certainly don't build is a budget for the inference that happens when no user is watching.
Every AI product with persistent state runs invisible work in the background. Documents get preprocessed when uploaded. Long conversations get re-summarized at session boundaries so the next session doesn't blow the context window. Proactive suggestions get generated on a schedule nobody set deliberately. Embeddings get regenerated when someone updates the schema. None of this shows up in your latency dashboard. Frequently it isn't in your cost model. Almost never is it in your monitoring.
This is your AI product's dark energy — the compute that explains the gap between what your inference bill should be and what it actually is.
The Taxonomy of Invisible Inference
Before you can instrument background compute, you need to know what to look for. The categories are more numerous than most teams expect.
Preprocessing on ingest. When a user uploads a file, something has to turn it into a form the model can use. Document chunking is cheap; summarization isn't. If your product builds a per-document summary at upload time — for context injection, for semantic indexing, for a "what's in this file" affordance — that call happens outside any user-facing request, typically on a job queue. It's inference that your latency monitoring won't catch because it ran before the user's question arrived.
Session boundary reprocessing. Products that maintain conversation history face a structural problem: conversations get long, long contexts are expensive, and the next session shouldn't inherit 80,000 tokens of history. The common solution is background summarization — condense the prior conversation into a compact representation before the user returns. This inference fires at session close or on a timer, not in response to any user action. One feature serving ten thousand daily active users with ten-turn conversations is generating ten thousand background model calls per day that your request traces will never show.
Proactive suggestion generation. This is the category most likely to have been added without deliberate cost analysis. Someone wanted "smart recommendations" or "suggested next steps" to appear instantly when the user opens a view. The implementation: generate them in advance, cache the result, serve from cache on open. Background inference on a schedule, triggered by data changes, user activity signals, or a cron job. The latency experience is great; the cost is invisible.
Embedding regeneration on schema changes. Vector search works until you change what you're indexing. Update the chunking strategy, switch embedding models, add a new metadata field — every document in the corpus needs to be re-embedded. A million-document index getting re-embedded on a model migration is a substantial inference job. Teams typically plan for the time it takes; they rarely budget the cost, because the last time they did this was when the corpus was one-tenth the size.
Evaluation and judge calls. LLM-as-judge architectures call a model to score another model's output. If your product runs quality evaluations on sampled outputs, those judge calls are background compute. If you run improvement loops that generate candidates and score them, those are background compute. The user sees a response; the system burned three to five model calls deciding it was good enough to show.
Why It Never Appears on the Dashboard
The monitoring gap has a structural cause: standard observability is request-centric. A trace starts when a request arrives and ends when a response goes out. Background jobs don't fit that model. They run on queues, on schedules, on event triggers — nothing links them to a user request ID. When you look at your trace explorer, you see your user-facing inference. The background inference simply isn't there.
Cost attribution has the same shape. If you're tracking cost per API call, you're tracking calls that originated from user requests. The job queue worker calling the embedding API doesn't carry the same request context your web server does. The cost accumulates in the same line item on your invoice, but the per-request attribution logic never captures it.
The result is a systematic undercount. Real systems generate 15 to 40 model calls for every single visible user action when you account for fan-out, retries, judge calls, improvement loops, fallback chains, and unbounded context growth. Teams looking at per-request token counts are seeing a fraction of the actual spend profile. The rest is background, unattributed, often not even recognized as compute that needs a budget.
This explains a pattern that has surprised many teams: the cost of a feature in production scales much faster than the number of users. Users aren't the only thing driving inference. Background jobs scale with data volume, with the number of documents in the corpus, with the number of sessions that need reprocessing, with the number of items that need proactive recommendations. Those things can grow faster than the user count, and they're invisible to per-user cost attribution.
Making Background Compute a First-Class Cost Category
The fix starts with naming. The most useful distinction in AI cost observability is between inference triggered by a user action and inference not triggered by a user action. Once you name those two categories, you can measure them separately and ask different questions about each.
For user-triggered inference, the relevant question is: what is the cost per user action, and is it within the budget we designed for? For background inference, the question is: what is this job doing, does it deliver user value, and does that value justify its compute cost?
- https://medium.com/@ravikhurana_38440/the-hidden-cost-of-ai-inference-and-how-it-finally-became-visible-04015dc2b534
- https://www.softwareseni.com/why-your-ai-bill-exploded-between-pilot-and-production-and-how-to-predict-the-real-cost/
- https://www.cloudzero.com/blog/ai-cost-optimization/
- https://www.getmaxim.ai/articles/ai-cost-observability-tools-in-2026-a-practical-comparison/
- https://xenoss.io/blog/total-cost-of-ownership-for-enterprise-ai
- https://medium.com/barnacle-labs/embeddings-in-production-or-how-nothing-scales-like-youd-expect-it-to-part-1-costs-to-embed-a82482765215
