Skip to main content

6 posts tagged with "finops"

View all tags

The Cancellation Tax: Your Inference Bill After the User Hits Stop

· 9 min read
Tian Pan
Software Engineer

Your stop button is a lie. When a user clicks it, your UI stops rendering tokens; your provider, in most configurations, keeps generating them. The bytes never reach a browser, but they reach your invoice. The gap between what the user saw and what you paid for is the cancellation tax, and it is the single most under-reported line item on AI cost dashboards.

The reason the tax exists is structural. Autoregressive inference is a GPU-bound pipeline: by the time your client closes the TCP connection, the model has already been scheduled, KV-cached, and is emitting tokens at 30–200 per second. Most serving stacks do not check for client liveness between tokens. They finish the job, log the usage, and bill you. The client saw ten tokens; the log recorded eight hundred. Langfuse, Datadog, and every other observability platform will faithfully report the eight hundred, because that's what the provider's usage block reported.

Cost Per Feature, Not Cost Per Token: The Allocation Gap in AI Budgets

· 10 min read
Tian Pan
Software Engineer

Your finance team can tell you, to the dollar, what you spent on Anthropic and OpenAI last month. Your product team can tell you which features users touched the most. Nobody in the building can tell you whether Draft-Email is profitable, whether Summarize-Thread should stay in the free tier, or whether the new Rewrite-Tone feature is eating Draft-Email's lunch on a per-user basis. You have two dashboards that claim to track the same dollars and neither answers the question that actually drives product decisions.

This is the allocation gap. You measure token spend per endpoint because that is what the provider API gives you. But /chat serves twelve features that happen to share a prompt template, and "per endpoint" collapses all twelve into one line item. Pricing tiers, feature gating, deprecation calls, and the "do we ship this?" conversation all float on gut feel until someone does the plumbing to route token costs back to the features that incurred them.

The plumbing is not glamorous. It is request-level tagging, trace-to-telemetry joins, and a disciplined refusal to ship an AI feature without its own cost label. Teams that treat this as infrastructure investment end up with per-feature margin reports segmented by user cohort. Teams that defer it to next quarter end up making pricing decisions from vibes for eighteen months and discovering, after the fact, that a single customer segment was responsible for half the inference bill at negative margins.

The Model Bill Is 30% of Your Inference Cost

· 8 min read
Tian Pan
Software Engineer

A finance lead at a mid-sized AI company told me last quarter they had "optimized their LLM spend" by switching their agent backbone from Sonnet to Haiku. The token bill dropped 22%. The total inference cost per resolved ticket went down 4%. When we pulled the full decomposition, the model line item was roughly a third of the per-request cost. Retrieval, reranking, observability, retry amplification, and the human-in-the-loop review queue ate the rest — and none of those got cheaper when they swapped models.

This is the most common accounting error I see in AI teams right now. Token cost is the line item on the invoice you pay every month, so it becomes the number everyone optimizes. But for any non-trivial production system — RAG, agents, anything with tool use or evaluation gates — the model inference is often 30 to 50% of the real unit economics. The rest sits in places your engineering dashboard doesn't surface and your finance team doesn't categorize as "AI spend."

Token Spend Is a Security Signal Your SOC Isn't Watching

· 10 min read
Tian Pan
Software Engineer

The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.

That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.

Why Agent Cost Forecasting Is Broken — And What to Do Instead

· 10 min read
Tian Pan
Software Engineer

Your finance team wants a number. How much will the AI agent system cost per month? You give them an estimate based on average token usage, multiply by projected request volume, and add a safety margin. Three months later, the actual bill is 3x the forecast, and nobody can explain why.

This isn't a budgeting failure. It's a modeling failure. Traditional cost forecasting assumes that per-request costs cluster around a predictable mean. Agentic systems violate that assumption at every level. The execution path is variable. The number of LLM calls per request is variable. The token count per call is variable. And the interaction between these variables creates a cost distribution with a fat tail that eats your margin.

Token Economics for AI Agents: Cutting Costs Without Cutting Corners

· 10 min read
Tian Pan
Software Engineer

A Shopify-scale merchant assistant handling 10 million conversations per day costs $2.1 million per month without optimization — or $450,000 per month with it. That 78% gap isn't from algorithmic breakthroughs; it's from caching, routing, and a few engineering disciplines that most teams skip until the invoice arrives.

AI agents are not chatbots with extra steps. A single user request triggers planning, tool selection, execution, verification, and often retry loops — consuming roughly 5x more tokens than a direct chat interaction. A ReAct loop running 10 cycles can consume 50x tokens compared to a single pass. At frontier model prices, that math becomes a liability fast.

This post covers the mechanics of where agent costs come from and the concrete techniques — with numbers — that actually move the needle.