Skip to main content

29 posts tagged with "finops"

View all tags

GPU Capacity Is a Roadmap Constraint: The 18-Month Contract That Decided Q3

· 9 min read
Tian Pan
Software Engineer

Somewhere in your company, fourteen months ago, a finance director and a platform lead signed a multi-year accelerator commitment. They built a peak-load model from the prior quarter's telemetry, negotiated a discount of 40 to 70 percent off on-demand pricing, and locked in the cluster shape that your product roadmap now has to fit inside. Nobody on the product team was in the room. Nobody on the application engineering team saw the spreadsheet. The contract is binding, the discount only applies if the commitment is honored, and the capacity envelope it bought is now the silent ceiling on every Q3 feature your PMs are scoping.

The gap most teams don't notice until the second year: capacity contracts are roadmap decisions, but they're being made by people who don't see the roadmap, using inputs that don't include the roadmap. The product trio thinks it's choosing features from a clean priority backlog. Finance thinks it's optimizing a fixed envelope. Both are right inside their own frame, and the collision shows up in a planning meeting where an architect proposes a 70B-parameter model for the new assistant feature and the platform lead says, quietly, that the cluster is already at 85 percent and that model doesn't fit without crowding out something else.

The Off-Hours Cost Curve: Why Your AI Feature Spends Differently on Saturday Than on Tuesday

· 10 min read
Tian Pan
Software Engineer

The cost dashboard everyone looks at is a weekly rolling average, and that average is lying to you. Not in the sense that the number is wrong — it's a faithful arithmetic mean of a billing event stream — but in the sense that it is hiding the shape of the cost curve underneath. The hours between Friday evening and Monday morning consume tokens differently from the hours between Tuesday at 10am and Thursday at 4pm. The cohort active on Saturday at 3am is not the cohort active on Tuesday at 11am, and the per-user economics of those cohorts diverge by a factor that nobody writes down because the dashboard averaged it away.

Most teams discover this the first time a weekend automation script melts the budget. A LangChain agent gets into an infinite conversation cycle Friday night, runs for the better part of a week before anyone notices, and produces a five-figure invoice that has to be explained to finance on Monday morning. The post-incident review treats it as a one-off — bad retry logic, missing budget cap, didn't page on-call. But the same dashboard that hid the runaway loop is also hiding the steady-state version of the same phenomenon: a baseline of off-hours traffic whose unit economics are structurally worse than the business-hours baseline, every single week, and which the weekly average smooths into invisibility.

Per-Customer Cost Concentration: Why AI Cost Dashboards Hide the Power Law

· 12 min read
Tian Pan
Software Engineer

Your AI feature's cost is a distribution, not a number. The dashboard hanging on the wall of the eng-finance war room says $187,000 last month, broken out by feature, by model, and by region. None of those views answers the question the CFO is actually about to ask: "Who is paying us $40 a month and costing us $4,000?" When you sort by customer_id instead of by feature, the line that was a comfortable bar chart becomes a hockey stick, and the team that designed against the average customer discovers it has been quietly underwriting the top of the tail for a quarter.

The pattern is so consistent it deserves to be called a law. Across production LLM workloads, the top 1% of users routinely drive 30–50% of token spend, with similar shapes showing up at the top 0.1% and the top 0.01%. This isn't a quirk of any one product — it's what happens when you ship a feature whose marginal cost is variable and whose pricing is flat. Average-user margins look fine. Median-user margins look great. The integral over the heavy tail is where the quarter goes.

Token Accounting Drift: When Your Trace Logs Don't Match the Provider Invoice

· 9 min read
Tian Pan
Software Engineer

There is a finance meeting that happens at every company shipping a hosted LLM feature, usually around month four. The engineering team has been logging token counts from every request. The finance team has the provider's invoice. The numbers don't agree. Sometimes the gap is five percent. Sometimes it is thirty. The engineers say the invoice is wrong. The finance team says the logs are wrong. Both teams are technically correct, and neither owns the reconciliation.

The drift is not fraud. It is a structural measurement problem, and the structure has at least six independent failure modes that compound. A team that does not own those failure modes will spend the next quarter writing apology emails to FP&A about why the forecast slipped, when the real story is that nobody on the engineering side ever audited what "token" meant in their own logs.

The Inter-Team Token Budget War: When Your AI Platform Team Becomes a Treasury Department

· 10 min read
Tian Pan
Software Engineer

The team that built your internal LLM gateway scoped it for "rate limiting and audit." Eighteen months later, the same team is running a quarterly allocation meeting, mediating a quota dispute between two product groups, and discovering that the architecture they shipped to solve a capacity problem now functions as the company's internal AI treasury. Nobody chartered them for this role. Nobody removed it from their plate either.

This is the trajectory every AI platform team is on, and most of them get to the political economy phase before they have a policy, a sponsor, or even the telemetry to defend a decision. The technical work — request routing, key management, retries — is the easy half. The hard half is that finite provider quota plus three product teams with launch deadlines is a budget allocation system, and the team running the gateway is the one being asked to allocate.

Smaller Model, Bigger Bill: Why Cheaper-Per-Token Often Costs More

· 8 min read
Tian Pan
Software Engineer

A finance-led mandate to "switch to the smaller model" is one of the most reliable ways to raise your LLM bill quarter-over-quarter. The dashboard the procurement team is watching — cost per call, average tokens per request — keeps trending down. Meanwhile the invoice keeps trending up. By the time someone reconciles the two, six months of prompt iteration has been spent compensating for a model that's worse at the task, and the team is in too deep to walk it back without admitting the original switch was a mistake.

The mistake isn't about pricing. It's about the unit. Per-token price is a misleading axis when reasoning depth, retry count, and prompt size all vary by model. The right metric is tokens-per-successful-completion, and on that axis the cheaper model often loses.

Why Token Forecasts Drift After Launch — and How to Catch the Spike Before Finance Does

· 10 min read
Tian Pan
Software Engineer

The pre-launch cost model is a beautiful spreadsheet. It assumes a synthetic traffic mix run through a representative prompt at a tested cache hit rate and a clean tool-call path. The post-launch reality is that none of those assumptions survive the moment the feature actually starts working. The intents your synthetic traffic didn't cover are precisely the ones that stick. The marketing surge from a campaign engineering didn't get the meeting invite for lands on the highest-cost branch in your routing tree. The heavy-user cohort that uses 40× the median doesn't show up until week three.

The industry-wide version of this problem is now well-documented: surveys put the share of enterprises missing their AI cost forecasts by more than 25% at around 80%, and report routine cost increases of 5–10× in the months immediately after a successful launch. The crucial detail in those numbers is the word successful. Failed AI features stay on budget. The drift is driven by the feature working, not by the team doing something wrong. That makes it a planning artifact problem, not an engineering problem — and the planning artifact most teams reach for, the monthly bill, is the worst possible detector.

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

· 10 min read
Tian Pan
Software Engineer

The first time a finance team asks an AI product team to forecast unit economics, the conversation goes the same way. The team pulls up the inference dashboard, points at the monthly token spend, and says "that's our COGS." The CFO multiplies by projected volume, draws a line on a chart, and asks where the gross margin curve crosses 70%. Six weeks later, when the actual P&L lands, the inference number on the dashboard is correct and the gross margin is twenty points lower than the forecast. Nobody is lying. Inference was just half of what the agent actually costs.

The other half is distributed across line items that nobody on the AI team owns. The vector database bill grows quietly because retrieval volume tracks usage and re-indexing costs are billed against compute, not storage. The observability platform's invoice arrives from the platform team's budget. Embedding regeneration shows up as a CI cost. Telemetry storage is filed under data warehouse. Human review is in customer-success headcount. None of these line items is alarming on its own — and that is exactly why the integrated number is the one that surprises everyone.

Your AI Pricing Page Is a Leveraged Bet on Token Economics

· 9 min read
Tian Pan
Software Engineer

When the team published the AI tier at "$X per seat for unlimited AI," nobody on the pricing call thought of it as a derivative position. It looked like a SaaS pricing page — a number, a tier, a CTA. But every dollar of revenue from that page is now exposed to a token-cost curve set by a vendor whose roadmap does not care about your gross margin. You did not write a pricing page. You wrote a naked short on token volatility, and the strike is whatever your vendor charges next quarter.

The math arrives quickly. A handful of power users discover the workflow and start running it on the longest context they can fit. A competitor's UX change re-trains the median user to send queries that are 40% longer. The frontier model your feature is locked to gets a price-per-million bump because the older tier you were on is being deprecated. Any one of these is a margin event you cannot reverse from the pricing page in a single quarter — and they tend to arrive together.

AI Shadow IT: When Product Teams Build Their Own LLM Proxy

· 11 min read
Tian Pan
Software Engineer

The shadow IT incident your platform team is going to investigate in Q3 already happened in January. It looks like this: a senior engineer on a product team has a launch this month. The platform team's "official" LLM gateway is on the roadmap for "next quarter." So the engineer creates a corporate credit card OpenAI account, drops the API key into a .env file, ships the feature, and hits the public deadline. The launch is a success. Six months later, the FinOps team finds three vendor accounts nobody can attribute, the security team finds prompts containing customer data routed to a region not covered by the data processing agreement, and the platform team discovers the gateway it spent two quarters building has 14% adoption because every team that needed AI shipped without it.

This is not a security failure or a discipline failure. It is a platform-product velocity mismatch, and treating it as anything else guarantees the next gateway you ship will have the same adoption problem.

The 'Try a Bigger Model' Reflex Is a Refactor Smell

· 10 min read
Tian Pan
Software Engineer

A regression lands in standup: the support agent answered three customer questions wrong overnight. Someone says, "let's try Opus on this route and see if it fixes it." Forty minutes later the eval pass rate ticks back up, the team closes the ticket, and the inference bill quietly tripled on that path. Six weeks later the same shape of regression appears on a different route, and the same fix is applied. Your team has just trained a Pavlovian reflex: quality regression → escalate compute. The bigger model is the most expensive debugging tool in your stack, and you're now reaching for it first.

The trouble isn't that bigger models don't help. They do — sometimes a lot. The trouble is that bigger models are a strictly dominant masking strategy. When the prompt has a conflicting instruction, the retrieval is returning stale chunks, the tool description is being misread, or the eval set doesn't cover the failing distribution, a more capable model will round the corner of the failure without fixing any of those things. The next regression has the same root cause, the bill has compounded, and the underlying system is more brittle, not less, because the slack created by the upgrade kept anyone from looking under the hood.

Cost-Per-Correctness, Not Cost-Per-Token: The Unit Metric Your Bill Won't Tell You

· 11 min read
Tian Pan
Software Engineer

A team I know cut their inference bill 40% last quarter by migrating their support-email triage flow from a frontier model to a mid-tier one. The CFO sent a thank-you note. Six months later, customer support headcount was up two FTEs and average resolution time had risen 35%. Nobody connected the dots, because the dots lived in different dashboards: the inference bill on the platform team's, the support load on the operations team's. The migration looked like a win on the only metric anyone was tracking. The metric was wrong.

This is the cost-per-token trap. Your invoice tells you what you spent on tokens. It cannot tell you what you spent per correct task, because the inference vendor has no idea what "correct" means in your domain. They sold you raw compute. You bought outcomes — or thought you did. The gap between those two units is where AI unit economics quietly comes apart, and the team that doesn't measure the right denominator is running half the equation and shipping the other half blind.