58 posts tagged with "cost-optimization"

Agent Trace Sampling: When 'Log Everything' Costs $80K and Still Misses the Regression

April 28, 2026 · 10 min read

Software Engineer

The bill arrived in March. Eighty-one thousand dollars on traces alone, up from twelve in November. The team had turned on full agent tracing in October on the theory that more visibility was always better. By Q1 the observability line was running ahead of the inference line — and when an actual regression hit production, the trace that contained the failure was buried under twenty million successful spans nobody needed.

The mistake was not the decision to instrument. The mistake was importing a request-tracing mental model into a workload that does not behave like requests.

A typical web request produces a span tree with a handful of children: handler, database call, cache lookup, downstream service. An agent request produces a tree with five LLM calls, three tool invocations, two vector lookups, intermediate scratchpads, and a planner that reconsiders three of those steps. The same sampling policy that worked for the API gateway — head-sample 1%, keep everything else representative — produces a trace store where the median trace is a 200-span monster, the long tail is the only thing that matters, and the rate at which you discover incidents is uncorrelated with the rate at which you spend money.

Prompt Cache Thrashing: When Your Largest Tenant's Launch Triples Everyone's Bill

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

The bill arrives on the first of the month and it is three times what your spreadsheet said it would be. Nobody pushed a system prompt change. The dashboard says request volume is flat. p95 latency looks normal. The token-per-correct-task ratio is unchanged. And yet you owe the inference vendor an extra forty thousand dollars, and the only signal in the observability stack that even hints at why is a metric most teams never alarm on: cache hit rate, which dropped from 71% to 18% somewhere in the second week of the billing cycle, on a Tuesday, at 9:47 AM Pacific, which is when your largest tenant's customer-success team kicked off a coordinated onboarding push for two hundred new users.

Welcome to prompt cache thrashing — the multi-tenant failure mode that the SaaS playbook was supposed to have eliminated a decade ago, reintroduced through the back door by your inference provider's shared prefix cache. The provider's cache is shared across your organization's traffic. Your tenants share that cache with each other whether you want them to or not, and a single tenant whose prefix shape shifts overnight can evict the prefixes everyone else's unit economics depended on. The bill spikes for tenants who did nothing differently. Finance pages engineering. Engineering points at the dashboard, which shows nothing wrong, because the dashboard isn't measuring the thing that broke.

The Structured-Output Retry Loop Is Your Hidden Compute Waste

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

Pull up your structured-output dashboard. The number it proudly shows is something like "98.4% schema compliance." That's the success rate — the fraction of requests that produced a valid JSON object on the first try. The team built a retry wrapper for the other 1.6%, shipped it, and moved on. Two quarters later, the inference bill is up 15% on a request volume that grew by 4%. The CFO wants a story. The engineers don't have one, because the dashboard that tracks structured-output success doesn't track structured-output cost.

Here's the part the dashboard is hiding: the failure path is not a single retry. The first re-prompt fixes the missing enum field but introduces a malformed nested array. The second re-prompt fixes the array but drops a required key. The third pass finally validates, but by then the request has burned four full inference calls plus the original generation, and your per-request token meter shows the sum, not the loop. From the meter's perspective it's one expensive request. From the cost line's perspective it's a stochastic loop you never priced.

This post is about what that loop actually does to your compute budget, why your existing observability can't see it, and the disciplines that make it visible and bounded.

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.

Reasoning-Effort Budgeting: When Thinking Tokens Become a Finance Line Item

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The first time your finance team asks why a single user racked up a fifty-cent answer to a one-tenth-of-a-cent question, the call will not be about the model. It will be about the line on the invoice that did not exist twelve months ago: reasoning tokens. They look like output tokens on the bill, they bill at output-token rates on most providers, and they have no natural ceiling. A query that would have produced a four-hundred-token reply on a non-reasoning model can quietly burn eight thousand internal thinking tokens to get there — and the only person who notices is the one reconciling the spend.

For most of the API era, "tokens used" was an honest number. You sent a prompt in, you got a response out, and the bill was a clean function of both. Reasoning models broke that intuition. The model now generates a hidden, billable, internally-only-visible chain of thought before it emits the answer the caller will read, and the size of that chain depends on the model's own assessment of how hard the question was. The user-visible output may be a single sentence. The bill may be for ten pages.

The Embedding API Hidden Tax: Why Vector Spend Quietly Eclipses Generation

April 23, 2026 · 12 min read

Tian Pan

Software Engineer

A team I talked to last quarter had a moment of quiet panic when their finance partner flagged the AI bill. They had assumed, like most teams do, that the expensive line item would be generation — the GPT-class calls behind chat, summarization, and agent reasoning. It wasn't. Their monthly embedding spend had silently crossed generation in January, doubled it by March, and was on track to triple it by mid-year. Nobody had modeled it because per-token pricing on embedding models looks like rounding error: two cents per million tokens for small, thirteen cents for large. At that rate, who budgets for it?

The answer is: anyone whose product survives past prototype and starts indexing things at scale. Semantic search over a growing corpus, duplicate detection, classification, clustering, reindexing when you swap models — every one of these workloads burns embedding tokens by the billion, not by the million. And unlike generation, which is gated by user requests, embedding throughput is only gated by what you decide to index. That decision rarely gets a cost review.

This post is about the specific mechanics of how embedding spend escalates, the architectural levers that bend the curve, and the breakeven math for moving off a hosted API onto something you run yourself.

First-Touch Tool Burn: Why Your Agent Reads Twelve Files Before Doing What You Asked

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

Your agent just spent ninety seconds and a few dollars to change a three-line function. Before the edit landed, it listed two directories, opened the test file, ran a grep for callers, read the config module, checked the CI workflow, and pulled up a type definition it never used. The diff it produced was four lines. The trace that produced it was forty-three tool calls.

This is first-touch tool burn: the pattern where an agent, handed a well-scoped task, behaves as if every request is a research problem. The exploration happens first and it happens hard — sixty to eighty percent of the token budget spent on listing, grepping, and reading before a single character is written to a file. Teams discover this the first time they look at a trace and realize the agent did the equivalent of a two-hour onboarding for a two-minute task.

The behavior isn't a bug in any specific model. It's the predictable output of how these systems were trained and evaluated, colliding with a production environment that measures something training never did: whether the work was cheap enough to bother doing at all.

The Reasoning-Model Tax at Tool Boundaries

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

Extended thinking wins benchmarks on novel reasoning. At a tool boundary — the moment your agent has to pick which function to call, when to call it, and what arguments to pass — that same thinking budget often makes things worse. The model weighs three equivalent tools that a fast model would have disambiguated in one token. It manufactures plausible-sounding ambiguity where none existed. It burns a thousand reasoning tokens to second-guess the obvious search call, then calls search anyway. You paid the reasoning tax on a decision that didn't need reasoning.

This is the quiet cost center of agentic systems in 2026: not the reasoning model itself, which is priced fairly for what it does well, but the reasoning model deployed at the wrong step of the loop. The anti-pattern hides in plain sight because the top-of-loop task looks hard ("answer the user's question"), so teams wrap the entire loop in high-effort thinking mode and never notice that 80% of the thinking budget is being spent deliberating on tool-choice micro-decisions the model already got right on its first instinct.

The Reflection Placebo: Why Plan-Reflect-Replan Loops Return Version One

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

Open an agent's trace during a long-horizon planning task and count the number of times the model writes "let me reconsider," "on reflection," or "a better approach would be." Now compare the plan it finally commits to with the one it drafted first. In the majority of traces I've audited, the second plan is the first plan wearing a different hat — the same decomposition, the same tool calls, the same order of operations, with some renamed step labels and a reworded rationale. The reflection ran. The model emitted tokens that looked like reconsideration. The plan did not move.

This matters because "with reflection" has quietly become a quality tier. Teams ship planners with one, two, or three reflection rounds and bill themselves for the difference. The inference spend is real and measurable. Whether anything on the plan side actually changed is a question almost nobody instruments for, and the answer is frequently: no.

LLM Cost Forecasting Before You Ship: The Estimation Problem Most Teams Skip

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

A team ships a support chatbot. In testing, the monthly bill looks manageable—a few hundred dollars across the engineering team's demo sessions. Three weeks into production, the invoice arrives: $47,000. Nobody had lied about the token counts. Nobody had made an arithmetic error. The production workload was simply a different animal than anything they'd simulated.

This pattern repeats constantly. Teams estimate LLM costs the way they estimate database query costs—by measuring a representative request and multiplying by expected volume. That mental model breaks badly for LLMs, because the two biggest cost drivers (output token length and tool-call overhead) are determined at inference time by behavior you cannot fully predict at design time.

This post is about how to forecast better before you ship, not how to optimize after the bill arrives.

The Multilingual Token Tax: What Building AI for Non-English Users Actually Costs

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your product roadmap says "expand to Japan and Brazil." Your finance model says the LLM API line item is $X per month. Both of those numbers are wrong, and you won't discover it until the international rollout is weeks away.

Tokenization — the step that turns user text into integers your model can process — is profoundly biased toward English. A sentence in Japanese might require 2–8× as many tokens as the same sentence in English. That multiplier feeds directly into API costs, context window headroom, and response latency. Teams that model their AI budget on English benchmarks and then flip on a language flag are routinely surprised by bills 3–5× higher than expected.

Prompt Cache Hit Rate: The Production Metric Your Cost Dashboard Is Missing

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

The first time your team enables prompt caching, it feels like free money. Within hours, your token cost drops 40–60% and latency shrinks. Engineers celebrate and move on. Three months later, someone notices costs have quietly crept back up. The cache hit rate that started at 72% is now 18%. Nothing was deliberately broken. Nobody noticed.

This is the most common arc in production LLM deployments: caching is enabled once, never monitored, and silently degrades as the codebase evolves. Cache hit rate is the most impactful cost lever in an LLM stack, and most teams treat it as a one-time setup task rather than a production metric.

About Tian Pan