Skip to main content

4 posts tagged with "cost-management"

View all tags

The Provider Auto-Router That Quietly Routed Your Premium Traffic To Haiku

· 10 min read
Tian Pan
Software Engineer

Your platform team adopted the provider's "auto" model identifier for cost reasons. The first dashboard after rollout was hard to argue with: a 34% spend reduction with no measurable quality drop on the weekly eval. Three months later, customer satisfaction on your shortest, highest-volume surface had been sliding for two quarters, and a product-led investigation eventually traced the regression to a model identifier nobody on the engineering team had touched. The code said "auto." The provider had been redefining what "auto" meant the whole time.

The lesson is not that auto-routing is bad. The lesson is that "auto" is a moving target whose distribution drifts with provider economics, and your eval's representativeness is the only check standing between vendor optimization and your product quality. If the eval does not match the traffic, the discount you celebrated is being paid out of a quality slope nobody is reviewing.

The Token Count Your Client Estimated And Your Provider Invoiced

· 12 min read
Tian Pan
Software Engineer

Your application counted tokens locally with a tokenizer library matching what you believed the provider used. The SDK reported "estimated 4,200 tokens" before each call. Your budget logic admitted the request. Then the provider's invoice came back at 6,800 tokens for the same payload. Multiply that 60% gap by a few million calls a month and the line item your finance team cannot reconcile against your own logs starts to look like an architectural mistake rather than a rounding error.

The mistake is not that the local tokenizer was wrong. The mistake is treating the local tokenizer as a contract instead of a guess. Tokenization is something the provider does inside their serving stack — your library is a model of that process, not the process itself, and the two drift in ways that are small per call and structural across the population of calls you actually make.

Model Migration Bills You Twice: The Eval Re-Anchoring Tax Nobody Prices

· 10 min read
Tian Pan
Software Engineer

Every model upgrade gets sold to the team as a swap: a one-line config change, a measurable win on latency or cost or quality, and a few days of prompt re-tuning to absorb the new model's quirks. The procurement deck shows per-token deltas, the engineering ticket lists the rollout phases, and the FP&A team books the quarterly savings. Then the eval scores come in and nobody recognizes them. Quality is flat where it should have moved. Two judges that used to agree are now diverging by ten points. The snapshot suite is red, but the diffs look like rewordings. Somebody in standup asks the question that should have been on the migration plan from day one: what is the model actually scoring against?

This is the second bill — the eval re-anchoring tax — and it is reliably larger than the first. The human-annotated reference scores were anchored to the previous model's output distribution. The LLM-as-judge graders were calibrated against the old model's failure modes. The snapshot fixtures captured the old model's wording. The team's intuition for "good output" was trained on the old model's stylistic tells. None of that survives the swap intact.

The Off-Hours Cost Curve: Why Your AI Feature Spends Differently on Saturday Than on Tuesday

· 10 min read
Tian Pan
Software Engineer

The cost dashboard everyone looks at is a weekly rolling average, and that average is lying to you. Not in the sense that the number is wrong — it's a faithful arithmetic mean of a billing event stream — but in the sense that it is hiding the shape of the cost curve underneath. The hours between Friday evening and Monday morning consume tokens differently from the hours between Tuesday at 10am and Thursday at 4pm. The cohort active on Saturday at 3am is not the cohort active on Tuesday at 11am, and the per-user economics of those cohorts diverge by a factor that nobody writes down because the dashboard averaged it away.

Most teams discover this the first time a weekend automation script melts the budget. A LangChain agent gets into an infinite conversation cycle Friday night, runs for the better part of a week before anyone notices, and produces a five-figure invoice that has to be explained to finance on Monday morning. The post-incident review treats it as a one-off — bad retry logic, missing budget cap, didn't page on-call. But the same dashboard that hid the runaway loop is also hiding the steady-state version of the same phenomenon: a baseline of off-hours traffic whose unit economics are structurally worse than the business-hours baseline, every single week, and which the weekly average smooths into invisibility.