Skip to main content

55 posts tagged with "cost-optimization"

View all tags

The Quality Tax of Over-Specified System Prompts

· 9 min read
Tian Pan
Software Engineer

Most engineering teams discover the same thing on their first billing spike: their system prompt has quietly grown to 4,000 tokens of carefully reasoned instructions, and the model has quietly started ignoring half of them. The fix is rarely to add more instructions. It's almost always to delete them.

The instinct to be exhaustive is understandable. More constraints feel like more control. But there's a measurable quality degradation that kicks in as system prompts bloat — and it compounds with cost in ways that aren't visible until they hurt. Research consistently finds accuracy drops at around 3,000 tokens of input, well before hitting any nominal context limit. The model doesn't refuse to comply; it just starts underperforming in ways that are hard to pin down.

This post is about making that degradation visible, understanding why it happens, and building a trimming discipline that doesn't require hoping nothing breaks.

Model Routing in Production: When the Router Costs More Than It Saves

· 10 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company deployed a model router six months ago with a clear goal: stop paying frontier-model prices for the 70% of queries that are simple lookups and reformatting tasks. They ran it for three months before someone did the math. Total inference cost had gone up by 12%.

The router itself was cheap — a lightweight classifier adding about 2ms of overhead per request. But the classifier's decision boundary was miscalibrated. It escalated 60% of queries to the expensive model, not 30%. The 40% it handled locally had worse quality, which increased user retry rates, which increased total request volume. The router's telemetry showed "routing working correctly" because it was routing — it just wasn't routing well.

This failure pattern is more common than the success stories suggest. Here's how to build routing that actually saves money.

The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse

· 9 min read
Tian Pan
Software Engineer

There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.

This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.

Prompt Cache Break-Even: The Exact Math on When Provider-Side Prefix Caching Actually Pays Off

· 9 min read
Tian Pan
Software Engineer

Prompt caching sounds like a clear win: Anthropic and OpenAI both advertise a 90% discount on cache hits, and the documentation shows impressive cost reduction charts. Teams implement it, monitor the cache hit rate counter going up, and assume they're saving money. Some of them are paying more than if they hadn't cached anything.

The issue is the write premium. Every time you cache a prefix, you pay a surcharge — 1.25× on a 5-minute cache window, 2× for a 1-hour window. If your hit rate is too low, those write premiums accumulate faster than the read discounts recover them. Caching is not free insurance; it's a bet you place against your own traffic patterns.

The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever

· 11 min read
Tian Pan
Software Engineer

Look at the max_tokens parameter on every LLM call in your codebase. If you're like most teams, it's either unset, set to the model's maximum, or set to some round number like 4096 that someone picked six months ago and nobody has touched since. It's the one budget knob in your API request that's staring you in the face, and it's silently paying for slack you never use.

Output tokens cost roughly four times what input tokens cost on the median commercial model, and as much as eight times on the expensive end. The economics of the generation step are completely lopsided: every unused token of headroom you leave in max_tokens is a token you might pay for, and every token you generate extends your p50 latency linearly because decoding is sequential. Yet most production systems treat this parameter as a safety valve — set it high, forget it, move on.

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

· 8 min read
Tian Pan
Software Engineer

Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.

The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.

Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.

The Good Enough Model Selection Trap: Why Your Team Is Overpaying for AI

· 9 min read
Tian Pan
Software Engineer

Most teams ship their first AI feature on the best model available, because that's what the demo ran on and nobody had time to think harder about it. Then a second feature ships on the same model. Then a third. Six months later, every call across every feature routes to the frontier tier — and the bill is five to ten times higher than it needs to be.

The uncomfortable truth is that 40–60% of the requests your production system processes don't require frontier-level reasoning at all. They require competent text processing. Competent text processing is dramatically cheaper to buy.

The Inference Cost Paradox: Why Your AI Bill Goes Up as Models Get Cheaper

· 10 min read
Tian Pan
Software Engineer

In 2021, GPT-3 cost 60permilliontokens.Byearly2026,youcouldbuyequivalentperformancefor60 per million tokens. By early 2026, you could buy equivalent performance for 0.06. That is a 1,000x reduction in three years. During the same period, enterprise AI spending grew 320% — from 11.5billionto11.5 billion to 37 billion. The organizations spending the most on AI are overwhelmingly the ones that benefited most from falling prices.

This is not a contradiction. It is the Jevons Paradox, and it is running your AI budget.

The Inference-Time Personalization Trap: When User Context Costs More Than It Earns

· 9 min read
Tian Pan
Software Engineer

There's a pattern that shows up in nearly every AI product once it hits a few hundred thousand active users: the team adds personalization — injects user history, preference signals, behavioral data into every prompt — and watches the product get slightly better while the infrastructure bill gets significantly worse. When they finally pull the logs and measure the quality delta per added token, the curve is almost always the same shape: steep gains early, then a long plateau, then diminishing returns you're paying full price for.

Most teams don't run that analysis until they're already in the hole. This post is about why the trap exists, where personalization stops paying, and what the architectures that actually work look like in production.

AI Feature Billing Is an Engineering Problem Nobody Planned For

· 9 min read
Tian Pan
Software Engineer

Microsoft's Copilot launched with a clean story: 30/user/month,productivitymultiplied.Theactualmathwasuglier.Onceyoufactoredinthebaseenterpriselicense,computecostsperactiveuser,andsupportoverhead,Microsoftwaslosingover30/user/month, productivity multiplied. The actual math was uglier. Once you factored in the base enterprise license, compute costs per active user, and support overhead, Microsoft was losing over 20 per user per month on the feature. Finance didn't catch this immediately because the costs lived in the infrastructure budget, not the product P&L. Engineering knew the token bills were large. Nobody had connected the two lines.

This is the billing problem that most AI teams build into their products without realizing it. Not the pricing strategy problem — that's a product decision. The engineering problem: you have no infrastructure to measure what AI features actually cost per customer, per feature, and per request at the granularity required to make any pricing model work.

Coalesce Before You Call: The LLM Request Batching Pattern That Cuts Costs Without Slowing Users Down

· 11 min read
Tian Pan
Software Engineer

Most teams discover request coalescing the same way: through a surprisingly large invoice. They ship an LLM-backed feature, usage grows, and then the billing dashboard shows they're paying for fifty thousand requests a day when closer examination reveals that roughly thirty thousand of them were asking the same thing in slightly different words. Each paraphrase of "summarize this document" hit the model separately. Each near-duplicate triggered a full inference cycle. The cost scaled with traffic volume, not with the semantic diversity of what users actually wanted.

Request coalescing is the pattern that fixes this. It is not one technique but a layered architecture: in-flight deduplication to prevent concurrent duplicates, exact caching for repeated identical prompts, and semantic batching to catch the paraphrased variations in between. The order matters, the thresholds matter, and understanding where the pattern breaks down — particularly around streaming — is what separates a working implementation from one that saves money on a staging server but causes subtle bugs in production.