2 posts tagged with "cost-management"

Model Migration Bills You Twice: The Eval Re-Anchoring Tax Nobody Prices

May 14, 2026 · 10 min read

Software Engineer

Every model upgrade gets sold to the team as a swap: a one-line config change, a measurable win on latency or cost or quality, and a few days of prompt re-tuning to absorb the new model's quirks. The procurement deck shows per-token deltas, the engineering ticket lists the rollout phases, and the FP&A team books the quarterly savings. Then the eval scores come in and nobody recognizes them. Quality is flat where it should have moved. Two judges that used to agree are now diverging by ten points. The snapshot suite is red, but the diffs look like rewordings. Somebody in standup asks the question that should have been on the migration plan from day one: what is the model actually scoring against?

This is the second bill — the eval re-anchoring tax — and it is reliably larger than the first. The human-annotated reference scores were anchored to the previous model's output distribution. The LLM-as-judge graders were calibrated against the old model's failure modes. The snapshot fixtures captured the old model's wording. The team's intuition for "good output" was trained on the old model's stylistic tells. None of that survives the swap intact.

The Off-Hours Cost Curve: Why Your AI Feature Spends Differently on Saturday Than on Tuesday

May 13, 2026 · 10 min read

Tian Pan

Software Engineer

The cost dashboard everyone looks at is a weekly rolling average, and that average is lying to you. Not in the sense that the number is wrong — it's a faithful arithmetic mean of a billing event stream — but in the sense that it is hiding the shape of the cost curve underneath. The hours between Friday evening and Monday morning consume tokens differently from the hours between Tuesday at 10am and Thursday at 4pm. The cohort active on Saturday at 3am is not the cohort active on Tuesday at 11am, and the per-user economics of those cohorts diverge by a factor that nobody writes down because the dashboard averaged it away.

Most teams discover this the first time a weekend automation script melts the budget. A LangChain agent gets into an infinite conversation cycle Friday night, runs for the better part of a week before anyone notices, and produces a five-figure invoice that has to be explained to finance on Monday morning. The post-incident review treats it as a one-off — bad retry logic, missing budget cap, didn't page on-call. But the same dashboard that hid the runaway loop is also hiding the steady-state version of the same phenomenon: a baseline of off-hours traffic whose unit economics are structurally worse than the business-hours baseline, every single week, and which the weekly average smooths into invisibility.

About Tian Pan