Skip to main content

64 posts tagged with "cost-optimization"

View all tags

Prompt Cache Break-Even: The Exact Math on When Provider-Side Prefix Caching Actually Pays Off

· 9 min read
Tian Pan
Software Engineer

Prompt caching sounds like a clear win: Anthropic and OpenAI both advertise a 90% discount on cache hits, and the documentation shows impressive cost reduction charts. Teams implement it, monitor the cache hit rate counter going up, and assume they're saving money. Some of them are paying more than if they hadn't cached anything.

The issue is the write premium. Every time you cache a prefix, you pay a surcharge — 1.25× on a 5-minute cache window, 2× for a 1-hour window. If your hit rate is too low, those write premiums accumulate faster than the read discounts recover them. Caching is not free insurance; it's a bet you place against your own traffic patterns.

The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever

· 11 min read
Tian Pan
Software Engineer

Look at the max_tokens parameter on every LLM call in your codebase. If you're like most teams, it's either unset, set to the model's maximum, or set to some round number like 4096 that someone picked six months ago and nobody has touched since. It's the one budget knob in your API request that's staring you in the face, and it's silently paying for slack you never use.

Output tokens cost roughly four times what input tokens cost on the median commercial model, and as much as eight times on the expensive end. The economics of the generation step are completely lopsided: every unused token of headroom you leave in max_tokens is a token you might pay for, and every token you generate extends your p50 latency linearly because decoding is sequential. Yet most production systems treat this parameter as a safety valve — set it high, forget it, move on.

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

· 8 min read
Tian Pan
Software Engineer

Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.

The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.

Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.

The Good Enough Model Selection Trap: Why Your Team Is Overpaying for AI

· 9 min read
Tian Pan
Software Engineer

Most teams ship their first AI feature on the best model available, because that's what the demo ran on and nobody had time to think harder about it. Then a second feature ships on the same model. Then a third. Six months later, every call across every feature routes to the frontier tier — and the bill is five to ten times higher than it needs to be.

The uncomfortable truth is that 40–60% of the requests your production system processes don't require frontier-level reasoning at all. They require competent text processing. Competent text processing is dramatically cheaper to buy.

The Inference Cost Paradox: Why Your AI Bill Goes Up as Models Get Cheaper

· 10 min read
Tian Pan
Software Engineer

In 2021, GPT-3 cost 60permilliontokens.Byearly2026,youcouldbuyequivalentperformancefor60 per million tokens. By early 2026, you could buy equivalent performance for 0.06. That is a 1,000x reduction in three years. During the same period, enterprise AI spending grew 320% — from 11.5billionto11.5 billion to 37 billion. The organizations spending the most on AI are overwhelmingly the ones that benefited most from falling prices.

This is not a contradiction. It is the Jevons Paradox, and it is running your AI budget.

The Inference-Time Personalization Trap: When User Context Costs More Than It Earns

· 9 min read
Tian Pan
Software Engineer

There's a pattern that shows up in nearly every AI product once it hits a few hundred thousand active users: the team adds personalization — injects user history, preference signals, behavioral data into every prompt — and watches the product get slightly better while the infrastructure bill gets significantly worse. When they finally pull the logs and measure the quality delta per added token, the curve is almost always the same shape: steep gains early, then a long plateau, then diminishing returns you're paying full price for.

Most teams don't run that analysis until they're already in the hole. This post is about why the trap exists, where personalization stops paying, and what the architectures that actually work look like in production.

AI Feature Billing Is an Engineering Problem Nobody Planned For

· 9 min read
Tian Pan
Software Engineer

Microsoft's Copilot launched with a clean story: 30/user/month,productivitymultiplied.Theactualmathwasuglier.Onceyoufactoredinthebaseenterpriselicense,computecostsperactiveuser,andsupportoverhead,Microsoftwaslosingover30/user/month, productivity multiplied. The actual math was uglier. Once you factored in the base enterprise license, compute costs per active user, and support overhead, Microsoft was losing over 20 per user per month on the feature. Finance didn't catch this immediately because the costs lived in the infrastructure budget, not the product P&L. Engineering knew the token bills were large. Nobody had connected the two lines.

This is the billing problem that most AI teams build into their products without realizing it. Not the pricing strategy problem — that's a product decision. The engineering problem: you have no infrastructure to measure what AI features actually cost per customer, per feature, and per request at the granularity required to make any pricing model work.

Coalesce Before You Call: The LLM Request Batching Pattern That Cuts Costs Without Slowing Users Down

· 11 min read
Tian Pan
Software Engineer

Most teams discover request coalescing the same way: through a surprisingly large invoice. They ship an LLM-backed feature, usage grows, and then the billing dashboard shows they're paying for fifty thousand requests a day when closer examination reveals that roughly thirty thousand of them were asking the same thing in slightly different words. Each paraphrase of "summarize this document" hit the model separately. Each near-duplicate triggered a full inference cycle. The cost scaled with traffic volume, not with the semantic diversity of what users actually wanted.

Request coalescing is the pattern that fixes this. It is not one technique but a layered architecture: in-flight deduplication to prevent concurrent duplicates, exact caching for repeated identical prompts, and semantic batching to catch the paraphrased variations in between. The order matters, the thresholds matter, and understanding where the pattern breaks down — particularly around streaming — is what separates a working implementation from one that saves money on a staging server but causes subtle bugs in production.

The Observability Tax: When Monitoring Your AI Costs More Than Running It

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI-powered customer support bot. It works. Users are happy. Then the monthly bill arrives, and you discover that the infrastructure watching your LLM calls costs more than the LLM calls themselves.

This isn't a hypothetical. Teams are reporting that adding AI workload monitoring to their existing Datadog or New Relic setup increases their observability bill by 40–200%. Meanwhile, inference costs keep dropping — GPT-4-class performance now runs at 0.40permilliontokens,downfrom0.40 per million tokens, down from 20 in late 2022. The monitoring stack hasn't gotten that memo.

The result is an inversion that would be funny if it weren't expensive: you're paying more to watch your AI think than to make it think.

Why Agent Cost Forecasting Is Broken — And What to Do Instead

· 10 min read
Tian Pan
Software Engineer

Your finance team wants a number. How much will the AI agent system cost per month? You give them an estimate based on average token usage, multiply by projected request volume, and add a safety margin. Three months later, the actual bill is 3x the forecast, and nobody can explain why.

This isn't a budgeting failure. It's a modeling failure. Traditional cost forecasting assumes that per-request costs cluster around a predictable mean. Agentic systems violate that assumption at every level. The execution path is variable. The number of LLM calls per request is variable. The token count per call is variable. And the interaction between these variables creates a cost distribution with a fat tail that eats your margin.

The Batch LLM Pipeline Blind Spot: Offline Processing and the Queue Design Nobody Talks About

· 11 min read
Tian Pan
Software Engineer

Most teams building with LLMs optimize for the wrong workload. They obsess over time-to-first-token, streaming latency, and response speed — then discover that 60% or more of their LLM API spend goes to nightly summarization jobs, data enrichment pipelines, and classification runs that nobody watches in real time. The latency-first mental model that works for chat applications actively sabotages these offline workloads.

The batch LLM pipeline is the unglamorous workhorse of production AI. It's the nightly job that classifies 50,000 support tickets, the weekly pipeline that enriches your CRM with company descriptions, the daily run that generates embeddings for new documents. These workloads have fundamentally different design constraints than real-time serving, and treating them as slow versions of your chat API is where the problems start.