Skip to main content

58 posts tagged with "cost-optimization"

View all tags

The Self-Critique Tax: When Asking the Model to Check Its Own Work Costs Double for Modest Wins

· 11 min read
Tian Pan
Software Engineer

A team ships a self-critique loop into production because the benchmark numbers looked irresistible: Self-Refine reported a 20 percent absolute improvement averaged across seven tasks, Chain-of-Verification cut hallucinations by 50 to 70 percent on QA workloads, and reflection prompts pushed math-equation accuracy up 34.7 percent in one widely-cited paper. A month later the finance review surfaces the bill. The product's per-request cost has roughly tripled, p99 latency is up by a factor of three, and the actual quality lift that survived contact with production traffic is closer to three percent than thirty. The self-critique loop is doing exactly what it advertised. The team just never priced it.

This is the self-critique tax: a reliability pattern that reads like a free quality win on a slide and reads like a structural cost increase on an invoice. The pattern itself is sound — there are real cases where generate-then-verify is the right answer. The failure mode is shipping it as a default instead of as a calibrated intervention, and discovering at the wrong time of the quarter that "the model checks its own work" was actually a procurement decision.

Smaller Model, Bigger Bill: Why Cheaper-Per-Token Often Costs More

· 8 min read
Tian Pan
Software Engineer

A finance-led mandate to "switch to the smaller model" is one of the most reliable ways to raise your LLM bill quarter-over-quarter. The dashboard the procurement team is watching — cost per call, average tokens per request — keeps trending down. Meanwhile the invoice keeps trending up. By the time someone reconciles the two, six months of prompt iteration has been spent compensating for a model that's worse at the task, and the team is in too deep to walk it back without admitting the original switch was a mistake.

The mistake isn't about pricing. It's about the unit. Per-token price is a misleading axis when reasoning depth, retry count, and prompt size all vary by model. The right metric is tokens-per-successful-completion, and on that axis the cheaper model often loses.

Token-Aware Logging: When Your Traces Cost More Than the Inference They Observe

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter spent six weeks chasing a memory pressure alert on their agent platform. The agents were cheap — a few cents a run. The traces were not. Their telemetry pipeline was eating three times the budget of the LLM calls it was instrumenting, and most of the spend went to fields nobody had read in months: full prompt bodies stored on every span, tool outputs duplicated across parent and child traces, and an LLM-judge evaluator that re-paid the inference bill on every captured trace.

This is the AI observability cost crisis in miniature. A 2026 industry write-up modeled a customer support bot with 10,000 conversations and five turns each — that comes out to 200,000 LLM invocations, 400 million tokens, and roughly a million trace spans per day. Datadog users widely report observability bills jumping 40-200% after they instrument AI workloads on the same backend that handled their REST APIs. The pipeline is paying twice for the same tokens: once to generate them, once to remember them.

The fix is not "log less." The fix is to treat observability for AI systems as a workload with its own unit economics, separate from the request-response telemetry traditional services emit. Traditional logging is structured fields you can compress and forget; AI logging is unbounded text bodies that re-enter the inference budget every time something reads them. That distinction is what "token-aware logging" means.

AI Feature Payback: The ROI Model Your Finance Team Won't Fight You On

· 10 min read
Tian Pan
Software Engineer

Every engineering team shipping AI features eventually hits the same wall: finance wants a spreadsheet that justifies the spend, and the spreadsheet you built doesn't actually work.

The problem isn't that AI features lack ROI. The problem is that AI economics break every assumption the standard ROI model was built on — fixed capital, linear cost curves, predictable timelines. Teams that treat AI spending like SaaS licensing get numbers that either look deceptively good before launch or collapse six months into production. The ten-fold gap between measured AI initiatives (55% ROI) and ad-hoc deployments (5.9% ROI) comes almost entirely from whether teams got the measurement model right before they shipped.

Your AI Product's Dark Energy: The Background Compute Nobody Budgeted

· 10 min read
Tian Pan
Software Engineer

When your AI feature ships, you build a latency budget: how long does the model call take, how long does retrieval take, what's the p99 for the full request. What you almost certainly don't build is a budget for the inference that happens when no user is watching.

Every AI product with persistent state runs invisible work in the background. Documents get preprocessed when uploaded. Long conversations get re-summarized at session boundaries so the next session doesn't blow the context window. Proactive suggestions get generated on a schedule nobody set deliberately. Embeddings get regenerated when someone updates the schema. None of this shows up in your latency dashboard. Frequently it isn't in your cost model. Almost never is it in your monitoring.

This is your AI product's dark energy — the compute that explains the gap between what your inference bill should be and what it actually is.

Lazy Evaluation in AI Pipelines: Stop Calling the LLM Until You Have To

· 11 min read
Tian Pan
Software Engineer

Most AI pipelines are written as if every request deserves a full LLM call. The user submits a message, the pipeline passes it to the model, waits for a response, and returns it — every time, unconditionally. This works, but it's expensive, slow, and often unnecessary.

The fraction of requests that actually require a full LLM inference is smaller than most engineers assume. Research on token-level routing shows that only about 11% of tokens differ between a 1.5B and a 32B parameter model, and only 4.9% of tokens are genuinely "divergent" — meaning they alter the reasoning path if handled by the smaller model. Production semantic caches show that 65% of incoming traffic is semantically similar to something the pipeline has already answered. These aren't edge cases. They're the majority of your traffic, and you're paying full price to handle them.

The fix is lazy evaluation: don't invoke the expensive model until you've confirmed that the expensive model is actually needed.

Thinking Budgets: When Extended Reasoning Models Actually Make Economic Sense

· 10 min read
Tian Pan
Software Engineer

A surprising number of AI teams default to extended thinking on every query once they gain access to an o3-class or Claude extended thinking model. The logic seems obvious: smarter reasoning equals better outputs, so why not always enable it? The problem is that this reasoning fails to account for a basic fact of how test-time compute scaling works in practice. Extended thinking dramatically improves performance on a specific class of tasks, degrades quality on others, and can inflate your inference costs by 5–30x across the board. The teams getting the most value from these models treat the reasoning budget as an explicit decision — one with the same weight as model selection or prompt engineering.

This post lays out the task taxonomy, the cost structure, and the routing decision framework that distinguishes teams who use thinking budgets strategically from teams who are just paying a premium for an illusion of quality.

The Budget Inversion Trap: Why Your Most Valuable AI Features Get the Cheapest Inference

· 8 min read
Tian Pan
Software Engineer

Most teams optimize AI inference costs by routing cheaper queries to cheaper models. That sounds reasonable — and it's backwards. The queries that go to cheap models first aren't the simple ones. They're the complex ones, because those are the expensive ones your FinOps dashboard flagged.

The result: your contract renewal workflow, the one that closes six-figure deals, runs on a model that hallucinates clause references. Your customer support triage — entry-level stuff, genuinely low-stakes — gets frontier model treatment because nobody complained about it yet.

This is the budget inversion trap. It's not caused by negligence. It's the predictable output of applying cost pressure without value context.

The 20% Problem in Model Routing: When Cost Optimization Creates Second-Class Users

· 9 min read
Tian Pan
Software Engineer

Your routing system works exactly as designed. Eighty percent of queries go to the cheap model; twenty percent escalate to the capable one. Latency is down, costs dropped by 60%, and leadership is happy. Then someone pulls the data by user segment, and you see it: users writing in non-native English are escalated at half the rate of native speakers, and their satisfaction scores are 18 points lower. The routing system treated the query complexity signal as neutral, but it wasn't — it was a proxy for language proficiency, and you've been giving a systematically worse product to a specific group of users for months.

This is the 20% problem. It's not a bug in the router. It's an emergent property of any cost-optimized routing system that nobody measures until it's too late.

The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve

· 9 min read
Tian Pan
Software Engineer

You're being charged for tokens that no user ever read. Not because of bugs, not because of vendor pricing tricks — but because your system is working exactly as designed, firing off background inference work that looked smart on a whiteboard but burns real budget on every request.

This is the shadow compute tax: the fraction of your inference spend that goes toward AI work that is speculative, premature, or structurally guaranteed never to reach a user. It's invisible in your dashboards until suddenly it isn't, and by then it's baked into your cost model as an assumption.

When to Skip Real-Time LLM Inference: The Production Case for Async Batch Pipelines

· 10 min read
Tian Pan
Software Engineer

There's a team somewhere right now watching their LLM spend grow 10x month-over-month while their p99 latency hovers around four seconds. The engineers added more retries. The retries hit rate limits. The rate limits triggered fallbacks. The fallbacks are also LLM calls. Nobody paused to ask: does this feature actually need to respond in real time?

Most AI product teams architect for the happy path — user sends a message, model responds, user sees it. The synchronous call pattern is what the API SDK demonstrates in its first code sample, and so that's what ships. But a surprisingly large share of production LLM workloads have nothing to do with a user waiting at a keyboard. They're document enrichment jobs, content classification pipelines, embedding generation tasks, nightly digest generation, and background quality scoring. For those workloads, real-time inference is the wrong tool — and the price you pay for using it anyway is real money, cascading failures, and operational complexity you'll spend months untangling.

Agents as Cron Jobs: When Scheduled Triggers Beat Conversational Loops

· 10 min read
Tian Pan
Software Engineer

Most "agents" in production today are background jobs wearing a chat interface. They do not need a user typing into them. They need a trigger, a state file, and a way to resume after the inevitable timeout. The conversational loop — request, tool call, request, tool call, indefinitely — is a demo affordance that quietly became the default execution model, and it is the wrong model for the majority of agentic work that ships.

The decision is not philosophical. It shows up on the bill, in the on-call pager, and in the percentage of runs that finish at all. A conversational loop holds a model session open across many turns, accumulates context, and dies if any link in the chain fails. A scheduled trigger fires at a deterministic boundary, runs to completion or to a checkpoint, and writes its state somewhere durable before exiting. One is a phone call. The other is a job queue. Treating the two as interchangeable is how a $200/month feature becomes a $40,000/month feature without anyone changing the prompt.