Skip to main content

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

· 10 min read
Tian Pan
Software Engineer

The first time a finance team asks an AI product team to forecast unit economics, the conversation goes the same way. The team pulls up the inference dashboard, points at the monthly token spend, and says "that's our COGS." The CFO multiplies by projected volume, draws a line on a chart, and asks where the gross margin curve crosses 70%. Six weeks later, when the actual P&L lands, the inference number on the dashboard is correct and the gross margin is twenty points lower than the forecast. Nobody is lying. Inference was just half of what the agent actually costs.

The other half is distributed across line items that nobody on the AI team owns. The vector database bill grows quietly because retrieval volume tracks usage and re-indexing costs are billed against compute, not storage. The observability platform's invoice arrives from the platform team's budget. Embedding regeneration shows up as a CI cost. Telemetry storage is filed under data warehouse. Human review is in customer-success headcount. None of these line items is alarming on its own — and that is exactly why the integrated number is the one that surprises everyone.

This is the central FinOps problem of agentic systems. The cost of an AI feature is a composition of seven or eight cost surfaces, each owned by a different team, each measured against a different KPI, each optimized in isolation. The team that ships the feature owns one or two of those surfaces. The team that owns the largest surface is often the last to know it is the largest.

The COGS Decomposition Nobody Draws

Pull a single resolved task — a customer support ticket closed by an agent, a contract clause flagged by an internal copilot, a code change proposed and merged by a coding agent — and try to attribute the actual dollar cost. The line items look something like this, and the proportions vary by product but the structure does not:

  • Foundation model inference. The bill on the dashboard. For agent workloads it includes the planner calls, the tool-arg generation, the structured-output retries, and the final-response synthesis. Industry estimates put this at 40–60% of total agent COGS, not 100%.
  • Retrieval-side embedding inference. Every document ingested needs an embedding. Every query, in many architectures, also needs a query-time embedding. The embedding provider's bill can match or exceed the vector database bill itself, depending on data churn.
  • Vector database queries and storage. At small scale, this is rounding error. At 100M+ vectors with sustained query throughput, it is a 3–5x multiplier over self-hosted alternatives, and the line item that "grew 4x faster than expected" in board decks.
  • Tool-call API costs. Every tool the agent invokes is either a paid third-party API, an internal service that runs on someone's infrastructure, or both. A single agent turn that hits search, calendar, CRM, and a payments API has four invoices behind it.
  • Structured-output retry compute. When tool args are malformed or schema-invalid, the loop retries. Each retry is a full inference call, often with the failure context appended, and a 2–3x retry rate on agentic workflows is the difference between a 40% gross margin and a 25% gross margin.
  • Telemetry and trace storage. Each span in an agent trace carries the system prompt, retrieved chunks, and full completions — tens of kilobytes per span versus the sub-kilobyte payloads of a typical REST trace. Teams routinely sample down to 0.1% of production traffic to keep observability bills tractable.
  • Eval-pipeline reruns. Every model bump, prompt change, and retrieval-config change reruns the eval suite. If the eval suite is large (and it should be), this is a recurring inference bill that does not appear in the user-facing inference dashboard because it is filed under "engineering compute."
  • Human-loop labor. The reviewer who approves edge cases, the annotator who labels production traces, the on-call engineer who triages bad outputs. This shows up as headcount, not COGS, but it scales with usage and belongs in the unit economics model.

The team that decomposes its agent COGS into these eight items can answer questions like "what is the marginal cost of a resolved task at our current volume" and "which line item dominates our cost-per-task at 10x scale." The team that does not is forecasting unit economics off a single number that captures less than half of what the feature actually costs.

Why The Inference Bill Lies First

The inference bill is the most visible cost surface and the least useful one for unit economics, for three structural reasons.

It is monthly and aggregated. The bill arrives as one number, sometimes split by model and sometimes not. Mapping that number back to specific features, customer cohorts, or task types requires telemetry that most teams have not invested in. By the time you can attribute a quarter of last month's spend to a specific feature, the spend has already happened.

It is the line item with the most public benchmarks. Vendors publish per-token prices. Newsletters compare them. The team's intuition is calibrated for inference cost, and not for any of the other seven surfaces. When a CFO asks "can we reduce cost," the inference dashboard is what gets opened, and the optimization happens at the layer where the team feels confident measuring.

It is the cost surface where the leverage is smallest. Inference prices have dropped roughly 1000x in three years. The marginal saving from negotiating a 15% discount on inference is real, but it is dwarfed by the marginal saving from a retrieval architecture that hits the cache 60% of the time, or a structured-output schema that drops the retry rate from 22% to 3%, or a telemetry sampling policy that retains 5% of traces with full payloads instead of 100% with truncated ones. Inference optimization is the easiest place to look. It is not the place with the largest lever.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates