Skip to main content

233 posts tagged with "observability"

View all tags

Per-Customer Cost Concentration: Why AI Cost Dashboards Hide the Power Law

· 12 min read
Tian Pan
Software Engineer

Your AI feature's cost is a distribution, not a number. The dashboard hanging on the wall of the eng-finance war room says $187,000 last month, broken out by feature, by model, and by region. None of those views answers the question the CFO is actually about to ask: "Who is paying us $40 a month and costing us $4,000?" When you sort by customer_id instead of by feature, the line that was a comfortable bar chart becomes a hockey stick, and the team that designed against the average customer discovers it has been quietly underwriting the top of the tail for a quarter.

The pattern is so consistent it deserves to be called a law. Across production LLM workloads, the top 1% of users routinely drive 30–50% of token spend, with similar shapes showing up at the top 0.1% and the top 0.01%. This isn't a quirk of any one product — it's what happens when you ship a feature whose marginal cost is variable and whose pricing is flat. Average-user margins look fine. Median-user margins look great. The integral over the heavy tail is where the quarter goes.

The Rerun Antipattern: Why Rolling Again Doesn't Find Bugs

· 10 min read
Tian Pan
Software Engineer

The first thing most engineers do when an AI feature misbehaves is click "run" again. The model is stochastic, the thinking goes, so maybe this run was just unlucky. When the second attempt produces something that looks reasonable, the ticket gets closed. The team moves on. The actual bug — a stale tool response, a retrieval miss, a system-prompt conflict that fires only on inputs containing a specific token — sits in production, intact, waiting for the next user to trip it.

This is the rerun antipattern, and it is the most expensive debugging habit AI teams have inherited from the chatbot era. It feels rigorous because the model genuinely is non-deterministic. It looks like a variance probe. But almost no one writes down a hypothesis before they reroll, no one decides in advance how many runs would constitute evidence, and no one accounts for the tokens. What's happening is closer to slot-machine debugging: you pull the lever until the lights stop flashing red, and you walk away convinced the machine is fine.

The Sliding-Window Tax: Why a 30-Turn Conversation Costs More Than 30x a Single Turn

· 9 min read
Tian Pan
Software Engineer

The conversation looks healthy on the dashboard. Average tokens per call is sane, the p50 input length is comfortably inside the cached prefix, the provider invoice ticks up at the rate finance approved. Then someone exports a single 200-turn coding session and the line item for that one user is larger than the rest of the team's daily traffic combined. The dashboard wasn't lying — it was averaging. The bill comes from the long tail, and the long tail does not scale linearly with turn count.

Every multi-turn AI feature eventually meets this surprise. The per-call token count is the wrong unit of measurement, because the cost of a 30-turn conversation is not 30 times the cost of a single turn — it's something between 50× and 200×, depending on how the history is structured, how the prompt cache decays, and what tier the request lands in once the input crosses 200K tokens. The team that priced the feature off the per-call number is underwriting a tail it never modeled.

Snapshot Eval Decay: When Green CI Stops Meaning Your Product Still Works

· 11 min read
Tian Pan
Software Engineer

Six months of green CI is hiding the fact that roughly forty percent of your eval set no longer represents what users actually do with your product. The suite still runs. The judge still scores. The dashboards still glow. But the cases were written against a query distribution, a corpus, a tool surface, and a regulatory text that have all moved underneath them — and a green run now means "yesterday's product still works on yesterday's reality," which is not the question you are paying CI to answer.

This is snapshot eval decay, and it is the slowest, most expensive failure mode in AI evaluation. Slow because the suite never fails — staleness shows up as inability to discriminate between models, not as red builds. Expensive because by the time someone notices that a model swap which the evals approved caused a production regression, the team has already accumulated a year of "we ship when evals pass" muscle memory built on top of an asset that quietly stopped working.

Token Accounting Drift: When Your Trace Logs Don't Match the Provider Invoice

· 9 min read
Tian Pan
Software Engineer

There is a finance meeting that happens at every company shipping a hosted LLM feature, usually around month four. The engineering team has been logging token counts from every request. The finance team has the provider's invoice. The numbers don't agree. Sometimes the gap is five percent. Sometimes it is thirty. The engineers say the invoice is wrong. The finance team says the logs are wrong. Both teams are technically correct, and neither owns the reconciliation.

The drift is not fraud. It is a structural measurement problem, and the structure has at least six independent failure modes that compound. A team that does not own those failure modes will spend the next quarter writing apology emails to FP&A about why the forecast slipped, when the real story is that nobody on the engineering side ever audited what "token" meant in their own logs.

Two-Hop Tool Chains: Why 95% Tools Compose Into 80% Pipelines

· 10 min read
Tian Pan
Software Engineer

The per-tool dashboard in your observability stack tells a comforting lie. search_listings is green at 96%. book_appointment is green at 95%. The agent that uses them back-to-back has been at 78% for three weeks and nobody can explain why. The reason isn't in either tool. It's in the seam between them — the place no dashboard panel exists.

Composition is not addition. When tool A's output flows into tool B's input, the failure surface isn't 1 - (0.96 × 0.95) against B's narrow definition of "valid call." It's the full cartesian product of every way A can be subtly off by B's standards: a date string in MM/DD/YYYY when B expects ISO 8601, a price returned in cents when B parses dollars, a paginated cursor that points one item past the last result, an entity ID that was renamed on the upstream service yesterday. Any of these passes A's own contract tests cleanly. Each one breaks B. The team's per-tool reliability metrics never see it because each tool is, by its own standards, fine.

The Vendor SLA Gap: Why Your LLM Provider's Uptime Misses the Failure Mode That Breaks Your Product

· 9 min read
Tian Pan
Software Engineer

Your LLM provider says 99.95% availability. Your status page is green. Your latency dashboard is in the SLO. Your product is broken anyway — the assistant started refusing routine requests this morning, the JSON outputs that powered the downstream parser shifted from compact to chatty, and a third of the support tickets you triage with a model are coming back with "I can't help with that." Every one of those responses returned 200 OK in under 800ms. None of them violated the SLA. The SLA covered the failure mode you do not actually have.

This is the gap nobody priced into the procurement conversation. The vendor sells availability — a request-level promise that the API answered in time — and the product team consumes capability, which is a request-level promise that the answer was usable. The two are not the same metric, and the team that confuses them is one quiet model bump away from learning the difference.

Escalation Rate Is the Eval Signal Your Offline Tests Missed

· 10 min read
Tian Pan
Software Engineer

Every agent feature has a back door. Some teams call it "escalate to support." Some call it "route to a human reviewer." Some call it the templated "I'm not able to help with that — let me connect you to someone who can." Whatever the label, every production agent has a path that gives up on the user's request and hands it to a human, and the rate at which production traffic takes that path is one of the few signals that doesn't depend on labelers, judges, or a hand-built test set. It is the system telling you, in production, that the model could not handle a request the user actually sent.

That signal is almost always being read by the wrong team. Escalation rate is a workforce-planning metric in most companies: it determines how many human agents the queue needs next quarter, and it lives on a dashboard the operations team reviews on a different cadence than the AI team reads its eval scores. A 30% week-over-week escalation increase shows up as a staffing question in a Monday operations review, while the AI team's eval suite stays green and the leadership readout says the feature is healthy. Both teams are looking at the same production system and arriving at opposite conclusions: ops thinks they need more headcount, AI thinks the model is fine.

The Eval Ceiling: When Your Golden Test Cases Stop Discriminating

· 10 min read
Tian Pan
Software Engineer

A year ago, your eval suite did its job beautifully. Candidate models came back with scores spread between 60 and 80, and the ranking told you something. The new fine-tune beat the baseline by six points; the cheaper model lost three. Decisions flowed from the numbers. Today, every candidate scores 95 or 96 or 97 on the same suite, and the spread has collapsed into noise. Your team is still running the eval, still reading the report, still using it to green-light migrations — but the report has stopped containing information.

This is not benchmark contamination. It is not world-drift decay. It is a measurement-instrument problem: your test cases were calibrated for a difficulty level that the platform passed. The ruler hasn't broken; the things you're measuring have outgrown it. And the team that doesn't notice keeps making model decisions with a tool whose discriminating range no longer overlaps the candidates being compared.

Eval Selection Bias: Why Your Test Set Goes Blind to the Failures That Drove Users Away

· 10 min read
Tian Pan
Software Engineer

There is a quiet failure mode in production-grade LLM evaluation that no leaderboard catches: your test set is built from the users who stayed, so it never asks the questions that made the others leave. Quarter over quarter the eval scores climb, the dashboards turn green, and net retention sags anyway. The team chases "is the eval gameable?" when the real story is simpler and harder. The eval distribution drifted toward survivors, and survivors are exactly the population whose feedback you least need.

This is the WWII bomber armor problem in a new costume. Abraham Wald looked at returning planes, noticed where the bullet holes clustered, and pointed out that the holes you should reinforce against are the ones on planes that didn't come back. Replace bombers with users, replace bullet holes with failed turns, and you have the central pathology of eval sets seeded from production traces.

The Fallback That Became the Default: Why Your Tier Mix Needs an SLO

· 11 min read
Tian Pan
Software Engineer

The dashboard says the fallback fires on 0.5% of requests. The dashboard has been saying that for six months. Then someone re-runs telemetry from scratch and finds the secondary model is serving 38% of traffic and the canned-response tier is serving another 9%. The frontier-model "primary path" the team has been talking about in roadmap reviews is, in fact, the minority experience. Nobody noticed because no single alert ever fired — every demotion was a small, well-justified, locally correct decision, and the cumulative drift never crossed any threshold someone had thought to set.

This is the failure mode I want to name: the fallback that became the default. It is not an outage. It is not a regression in any single component. It is a slow rotation of the product surface where the degraded path stops being a safety net and starts being the experience. The team's mental model and production reality drift apart, and the gap is invisible because the only meters in place are designed to detect failure, not to detect mix.

I'll claim something stronger: if your AI feature has more than two tiers of service, your tier mix is itself an SLO, and if you aren't measuring it, you don't actually know what you ship.

The Five Definitions of 'Now' Inside Your LLM Prompt

· 11 min read
Tian Pan
Software Engineer

A customer support agent told a user "based on our latest pricing, as of today" and quoted last quarter's price sheet. The system prompt interpolated today is {current_date} correctly. The retrieval layer pulled the document with the highest freshness score. The model answered confidently. Every component did exactly what it was specified to do, and the user got a wrong answer that the on-call engineer could not reproduce because, by the time they replayed the trace at 9pm, "today" was a different day.

This is not a rare bug. It is a failure mode that lives in almost every production LLM pipeline because "now" is implicit in the prompt at five different layers, and those layers were authored at different times, by different people, against different definitions of the present. As long as a request runs synchronously from a foreground user session, the layers mostly agree. The moment the request is replayed for debugging, batch-processed overnight, run from an eval harness pinned in March, or queued and consumed an hour later, the layers start disagreeing — and the model produces an answer that is internally consistent within its prompt but externally wrong.