Skip to main content

861 posts tagged with "insider"

View all tags

Per-Customer Cost Concentration: Why AI Cost Dashboards Hide the Power Law

· 12 min read
Tian Pan
Software Engineer

Your AI feature's cost is a distribution, not a number. The dashboard hanging on the wall of the eng-finance war room says $187,000 last month, broken out by feature, by model, and by region. None of those views answers the question the CFO is actually about to ask: "Who is paying us $40 a month and costing us $4,000?" When you sort by customer_id instead of by feature, the line that was a comfortable bar chart becomes a hockey stick, and the team that designed against the average customer discovers it has been quietly underwriting the top of the tail for a quarter.

The pattern is so consistent it deserves to be called a law. Across production LLM workloads, the top 1% of users routinely drive 30–50% of token spend, with similar shapes showing up at the top 0.1% and the top 0.01%. This isn't a quirk of any one product — it's what happens when you ship a feature whose marginal cost is variable and whose pricing is flat. Average-user margins look fine. Median-user margins look great. The integral over the heavy tail is where the quarter goes.

Per-Tenant Prompt Compilation: When Your System Prompt Becomes a Build Artifact

· 10 min read
Tian Pan
Software Engineer

The day a multi-tenant SaaS team adds the third if tenant_industry == "healthcare" branch to its system prompt is the day it accidentally hires itself a compiler engineer. Nobody filed the headcount req. Nobody scoped the work. The team thinks it is shipping a feature; it is actually shipping a build system, and the build system is held together with f-strings.

Every team that scales an AI feature into a customer base with even mild heterogeneity hits the same wall. Tenant A is in healthcare and needs HIPAA-aware response framing. Tenant B is in legal and needs strict citation discipline. Tenant C is an enterprise that bought a custom safety rubric in the master agreement. Tenant D is on the free tier and gets the default. The first instinct is to handle the variance with runtime conditionals, and the conditionals nest until the prompt becomes unreadable to anyone who didn't write it. The second instinct — and the one most teams arrive at after the wall — is prompt compilation: the canonical "prompt" is no longer a string but a source artifact, and what reaches the model is a compiled output.

Prompt Portfolios: Manage a Basket, Not a Single Best Prompt

· 10 min read
Tian Pan
Software Engineer

Most production AI teams talk about prompts the way junior traders talk about stocks: there is one best one, and the job is to find it. So they iterate — a Slack thread, a few eval rows, a new winner, push to main, repeat. The result is a single artifact carrying the entire intent-resolution surface of the product, optimized against a frozen evaluation set, sitting one regrettable edit away from a P1.

The mistake is the singular. A prompt is not a security; it is an allocation. The same user intent can be served well by several variants, each with its own confidence interval, its own per-segment performance, and its own sensitivity to model and corpus drift. The right mental model is not "find the best prompt" — it is "manage a basket of prompts whose composition is itself the product." Quantitative finance figured this out fifty years ago, and the operational machinery transfers almost without modification.

Prompts Don't Roll Back Like Code: Why git revert Is the Wrong Primitive

· 9 min read
Tian Pan
Software Engineer

A senior engineer ships a prompt change behind a 10% canary. By the next morning, the canary cohort's helpfulness score has dropped four points, the on-call notices, and the team does what every team does — they revert the commit and redeploy. The dashboard does not recover. It does not recover the next day either. Three days later, a postmortem reveals that the cohort that saw the bad prompt is still seeing degraded outputs because their conversation histories now contain assistant turns produced by the rolled-back prompt, and the model is conditioning on those turns. The commit is gone. The damage is not.

This is the part of LLMOps that the "treat prompts like code" advice quietly skips. Code rollback is a text replacement that restores a deterministic past state. Prompt rollback has to reconcile with a tail of side effects — caches, histories, eval baselines, experiment cohorts, downstream contracts — that the bad prompt has already imprinted on the production world. git revert flips the text. It does not flip the consequences.

Quantization Slippage: The Capability Tax Your Eval Set Was Never Built to Catch

· 11 min read
Tian Pan
Software Engineer

A self-hosted LLM team quantizes the production model from fp16 to int4. Memory drops 4×, throughput nearly doubles, the GPU bill shrinks, and the team reruns the same eval suite that gated the fp16 release. MMLU-Pro retains 98.1% of baseline. Aggregate quality looks fine. They ship.

Six weeks later, a support engineer notices the math tutoring feature has gotten quietly worse. The compliance team flags an uptick in policy-violation completions on adversarial prompts. The structured-output retry rate has crept from 1.4% to 6.8%. None of these show up on the eval dashboard, because the eval dashboard was built to validate a different model — the one that shared the same weights file but had four times more bits behind every activation.

This is quantization slippage. The cost analysis priced the memory win and the latency win. It did not price the eval re-anchoring that the swap silently demanded, and the eval suite, calibrated against the fp16 distribution, is now grading the wrong model with the wrong rubric.

Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts

· 10 min read
Tian Pan
Software Engineer

The cheapest line on the pricing page is rarely the cheapest line on the invoice. A team picks the workhorse model — Sonnet, Haiku, Flash, GPT-mini — because the per-token math is friendly, ships a feature, and watches the cost dashboard report a happy unit-economics story for a quarter. Then the long tail catches up: a slice of requests the workhorse can't quite handle starts retrying, then partially answering, then escalating to a human reviewer, and the per-feature P&L stops resembling the per-call dashboard.

The arbitrage is that, on those hard requests, a reasoning model the team would never default to — Opus, o3, the slow expensive one — frequently lands the answer on the first attempt. The all-in cost of one $0.50 reasoning call beats five $0.05 workhorse calls plus the escalation queue and the engineer who debugs the failure on Monday. The procurement question (which model is cheapest per token?) and the architecture question (which model is cheapest per resolved request?) are different questions, and the team that conflates them is paying the difference.

The Rerun Antipattern: Why Rolling Again Doesn't Find Bugs

· 10 min read
Tian Pan
Software Engineer

The first thing most engineers do when an AI feature misbehaves is click "run" again. The model is stochastic, the thinking goes, so maybe this run was just unlucky. When the second attempt produces something that looks reasonable, the ticket gets closed. The team moves on. The actual bug — a stale tool response, a retrieval miss, a system-prompt conflict that fires only on inputs containing a specific token — sits in production, intact, waiting for the next user to trip it.

This is the rerun antipattern, and it is the most expensive debugging habit AI teams have inherited from the chatbot era. It feels rigorous because the model genuinely is non-deterministic. It looks like a variance probe. But almost no one writes down a hypothesis before they reroll, no one decides in advance how many runs would constitute evidence, and no one accounts for the tokens. What's happening is closer to slot-machine debugging: you pull the lever until the lights stop flashing red, and you walk away convinced the machine is fine.

The Self-Critique Tax: When Asking the Model to Check Its Own Work Costs Double for Modest Wins

· 11 min read
Tian Pan
Software Engineer

A team ships a self-critique loop into production because the benchmark numbers looked irresistible: Self-Refine reported a 20 percent absolute improvement averaged across seven tasks, Chain-of-Verification cut hallucinations by 50 to 70 percent on QA workloads, and reflection prompts pushed math-equation accuracy up 34.7 percent in one widely-cited paper. A month later the finance review surfaces the bill. The product's per-request cost has roughly tripled, p99 latency is up by a factor of three, and the actual quality lift that survived contact with production traffic is closer to three percent than thirty. The self-critique loop is doing exactly what it advertised. The team just never priced it.

This is the self-critique tax: a reliability pattern that reads like a free quality win on a slide and reads like a structural cost increase on an invoice. The pattern itself is sound — there are real cases where generate-then-verify is the right answer. The failure mode is shipping it as a default instead of as a calibrated intervention, and discovering at the wrong time of the quarter that "the model checks its own work" was actually a procurement decision.

The Sliding-Window Tax: Why a 30-Turn Conversation Costs More Than 30x a Single Turn

· 9 min read
Tian Pan
Software Engineer

The conversation looks healthy on the dashboard. Average tokens per call is sane, the p50 input length is comfortably inside the cached prefix, the provider invoice ticks up at the rate finance approved. Then someone exports a single 200-turn coding session and the line item for that one user is larger than the rest of the team's daily traffic combined. The dashboard wasn't lying — it was averaging. The bill comes from the long tail, and the long tail does not scale linearly with turn count.

Every multi-turn AI feature eventually meets this surprise. The per-call token count is the wrong unit of measurement, because the cost of a 30-turn conversation is not 30 times the cost of a single turn — it's something between 50× and 200×, depending on how the history is structured, how the prompt cache decays, and what tier the request lands in once the input crosses 200K tokens. The team that priced the feature off the per-call number is underwriting a tail it never modeled.

Snapshot Eval Decay: When Green CI Stops Meaning Your Product Still Works

· 11 min read
Tian Pan
Software Engineer

Six months of green CI is hiding the fact that roughly forty percent of your eval set no longer represents what users actually do with your product. The suite still runs. The judge still scores. The dashboards still glow. But the cases were written against a query distribution, a corpus, a tool surface, and a regulatory text that have all moved underneath them — and a green run now means "yesterday's product still works on yesterday's reality," which is not the question you are paying CI to answer.

This is snapshot eval decay, and it is the slowest, most expensive failure mode in AI evaluation. Slow because the suite never fails — staleness shows up as inability to discriminate between models, not as red builds. Expensive because by the time someone notices that a model swap which the evals approved caused a production regression, the team has already accumulated a year of "we ship when evals pass" muscle memory built on top of an asset that quietly stopped working.

Tenancy Leaks Through Few-Shot Examples: When Your Prompt Library Becomes a Cross-Customer Data Store

· 11 min read
Tian Pan
Software Engineer

Open the production system prompt of a maturing AI product, scroll past the role description, and you will almost always find a section labeled # Examples or ## Few-shot demonstrations. The examples are excellent — they are concrete, they are domain-specific, they pattern-match exactly the failure modes the eval set was struggling with last quarter. They are also, on closer inspection, real customer data. A real ticket ID from a real account. A phrasing pattern lifted verbatim from a support thread. An internal product code that one tenant uses and the rest of the customer base has never heard of.

The team that put them there is not careless. The examples got into the prompt the way good examples always get into prompts: someone mined production traces for cases the model handled poorly, picked the cleanest worked example, pasted it into the system message, watched the eval scores climb, and shipped. That pipeline — production trace to system prompt — is the most reliable prompt-improvement loop in modern LLM engineering. It is also a structural cross-tenant data leak that the team built without noticing, and the system prompt has quietly become a multi-tenant data store the data-processing agreement never priced.

Two-Hop Tool Chains: Why 95% Tools Compose Into 80% Pipelines

· 10 min read
Tian Pan
Software Engineer

The per-tool dashboard in your observability stack tells a comforting lie. search_listings is green at 96%. book_appointment is green at 95%. The agent that uses them back-to-back has been at 78% for three weeks and nobody can explain why. The reason isn't in either tool. It's in the seam between them — the place no dashboard panel exists.

Composition is not addition. When tool A's output flows into tool B's input, the failure surface isn't 1 - (0.96 × 0.95) against B's narrow definition of "valid call." It's the full cartesian product of every way A can be subtly off by B's standards: a date string in MM/DD/YYYY when B expects ISO 8601, a price returned in cents when B parses dollars, a paginated cursor that points one item past the last result, an entity ID that was renamed on the upstream service yesterday. Any of these passes A's own contract tests cleanly. Each one breaks B. The team's per-tool reliability metrics never see it because each tool is, by its own standards, fine.