Skip to main content

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

· 10 min read
Tian Pan
Software Engineer

The first time a finance team asks an AI product team to forecast unit economics, the conversation goes the same way. The team pulls up the inference dashboard, points at the monthly token spend, and says "that's our COGS." The CFO multiplies by projected volume, draws a line on a chart, and asks where the gross margin curve crosses 70%. Six weeks later, when the actual P&L lands, the inference number on the dashboard is correct and the gross margin is twenty points lower than the forecast. Nobody is lying. Inference was just half of what the agent actually costs.

The other half is distributed across line items that nobody on the AI team owns. The vector database bill grows quietly because retrieval volume tracks usage and re-indexing costs are billed against compute, not storage. The observability platform's invoice arrives from the platform team's budget. Embedding regeneration shows up as a CI cost. Telemetry storage is filed under data warehouse. Human review is in customer-success headcount. None of these line items is alarming on its own — and that is exactly why the integrated number is the one that surprises everyone.

This is the central FinOps problem of agentic systems. The cost of an AI feature is a composition of seven or eight cost surfaces, each owned by a different team, each measured against a different KPI, each optimized in isolation. The team that ships the feature owns one or two of those surfaces. The team that owns the largest surface is often the last to know it is the largest.

The COGS Decomposition Nobody Draws

Pull a single resolved task — a customer support ticket closed by an agent, a contract clause flagged by an internal copilot, a code change proposed and merged by a coding agent — and try to attribute the actual dollar cost. The line items look something like this, and the proportions vary by product but the structure does not:

  • Foundation model inference. The bill on the dashboard. For agent workloads it includes the planner calls, the tool-arg generation, the structured-output retries, and the final-response synthesis. Industry estimates put this at 40–60% of total agent COGS, not 100%.
  • Retrieval-side embedding inference. Every document ingested needs an embedding. Every query, in many architectures, also needs a query-time embedding. The embedding provider's bill can match or exceed the vector database bill itself, depending on data churn.
  • Vector database queries and storage. At small scale, this is rounding error. At 100M+ vectors with sustained query throughput, it is a 3–5x multiplier over self-hosted alternatives, and the line item that "grew 4x faster than expected" in board decks.
  • Tool-call API costs. Every tool the agent invokes is either a paid third-party API, an internal service that runs on someone's infrastructure, or both. A single agent turn that hits search, calendar, CRM, and a payments API has four invoices behind it.
  • Structured-output retry compute. When tool args are malformed or schema-invalid, the loop retries. Each retry is a full inference call, often with the failure context appended, and a 2–3x retry rate on agentic workflows is the difference between a 40% gross margin and a 25% gross margin.
  • Telemetry and trace storage. Each span in an agent trace carries the system prompt, retrieved chunks, and full completions — tens of kilobytes per span versus the sub-kilobyte payloads of a typical REST trace. Teams routinely sample down to 0.1% of production traffic to keep observability bills tractable.
  • Eval-pipeline reruns. Every model bump, prompt change, and retrieval-config change reruns the eval suite. If the eval suite is large (and it should be), this is a recurring inference bill that does not appear in the user-facing inference dashboard because it is filed under "engineering compute."
  • Human-loop labor. The reviewer who approves edge cases, the annotator who labels production traces, the on-call engineer who triages bad outputs. This shows up as headcount, not COGS, but it scales with usage and belongs in the unit economics model.

The team that decomposes its agent COGS into these eight items can answer questions like "what is the marginal cost of a resolved task at our current volume" and "which line item dominates our cost-per-task at 10x scale." The team that does not is forecasting unit economics off a single number that captures less than half of what the feature actually costs.

Why The Inference Bill Lies First

The inference bill is the most visible cost surface and the least useful one for unit economics, for three structural reasons.

It is monthly and aggregated. The bill arrives as one number, sometimes split by model and sometimes not. Mapping that number back to specific features, customer cohorts, or task types requires telemetry that most teams have not invested in. By the time you can attribute a quarter of last month's spend to a specific feature, the spend has already happened.

It is the line item with the most public benchmarks. Vendors publish per-token prices. Newsletters compare them. The team's intuition is calibrated for inference cost, and not for any of the other seven surfaces. When a CFO asks "can we reduce cost," the inference dashboard is what gets opened, and the optimization happens at the layer where the team feels confident measuring.

It is the cost surface where the leverage is smallest. Inference prices have dropped roughly 1000x in three years. The marginal saving from negotiating a 15% discount on inference is real, but it is dwarfed by the marginal saving from a retrieval architecture that hits the cache 60% of the time, or a structured-output schema that drops the retry rate from 22% to 3%, or a telemetry sampling policy that retains 5% of traces with full payloads instead of 100% with truncated ones. Inference optimization is the easiest place to look. It is not the place with the largest lever.

A team that spends a quarter negotiating an inference discount while the vector database bill grows 4x is not optimizing — it is performing optimization theater on the wrong cost surface.

The Org Chart Decides Where The Lever Is

The most expensive structural feature of agent COGS is that no single team owns the integrated number. Inference cost lives on the platform team's budget. Vector database cost lives on the data infra team's budget. Tool-call costs are charged back to whoever owns the integrating service. Telemetry storage is owned by the observability team. Eval reruns are absorbed by engineering compute, often unattributed. Human review is in customer-success or operations headcount.

Each of these teams optimizes against its own KPI. The platform team gets a quarterly target for inference cost reduction. The data infra team is graded on vector database p99 latency. The observability team is asked to provide more retention, not less. None of them is wrong locally. All of them produce a globally suboptimal cost structure.

The discipline that has to land is an integrated COGS owner — a single role, often inside FinOps or product engineering, whose KPI is dollars per resolved task and whose authority spans all eight cost surfaces. Without this role, cost optimization fights happen at the wrong layer: the team most able to move the largest lever (retrieval architecture, eval reuse, telemetry sampling) does not have the budget pressure to move it, and the team with the budget pressure (platform, on the inference line) is optimizing the smallest lever.

This is why agentic SaaS companies trend toward 50–60% gross margins where traditional SaaS sits at 80–90%. It is not that AI is structurally less profitable. It is that the cost decomposition is unfamiliar, the org chart was built for the old composition, and the integrated owner has not been hired yet.

What A Useful COGS Dashboard Looks Like

A unit-economics dashboard worth building has one row per cost surface, two columns for current and projected at 10x volume, and a derived bottom line of dollars-per-resolved-task. The cost surfaces are the eight listed above. The "resolved task" is whatever your product treats as the unit of value: a closed support ticket, a generated report, a merged PR, a successful checkout assist.

A few properties of this dashboard matter:

  • It is per-feature, not per-product. Two features with the same model can have wildly different COGS structures because one is retrieval-heavy and the other is tool-heavy. Aggregating them hides the lever.
  • It is per-cohort, at least at the high-value-customer level. The customer who runs 200 turns per session has a different COGS profile than the customer who runs 4. Pricing depends on knowing the difference.
  • It is predictive at 10x. Today's COGS at today's volume is interesting; today's COGS at next year's projected volume is the number that survives a budgeting cycle. The biggest forecasting errors come from line items that are sublinear today and superlinear at scale — vector DB compute under sustained query load, telemetry storage with retention requirements, human-review labor when the easy automation tier saturates.
  • It is cost-per-resolved-task, not cost-per-call. Pricing models from Intercom's Fin (per resolution) and Zendesk (per resolved case) are converging on outcome-based units because that is the unit a customer actually cares about. The COGS dashboard should match the pricing dashboard, or unit economics never close.

The teams that have built this dashboard typically discover within a quarter that their largest optimization lever is somewhere they were not looking — most often retrieval cache hit rate, structured-output retry rate, or telemetry sampling — and that the inference bill they were renegotiating was the third or fourth lever, not the first.

The Architectural Frame

The realization that lands at the end of the COGS decomposition exercise is that an AI feature's cost structure is a system-design question, not a model-pricing question. The lever is not "which provider is cheapest per token." The lever is "what is the architecture that minimizes total composed cost per resolved task at the volume we expect to hit."

That architecture has knobs the team controls: cache before retrieval, retrieval before generation, structured outputs before validation, sampling before storage, automation tiers before human review. Each knob has a cost surface attached. The team that owns the integrated number turns the right knobs first; the team that does not turn the easy knob and is surprised when the number does not move.

The forecast that survives finance review at the end of the year is the one where every line item on the COGS dashboard is owned, instrumented, and modeled at projected volume. Anything less ships a unit economic story that the finance team will quietly disprove next quarter — and the AI team will, accurately, complain that nobody told them which lever to pull.

References:Let's stay in touch and Follow me for more thoughts and updates