Skip to main content

299 posts tagged with "observability"

View all tags

Time-of-Day Quality Drift: Why Your AI Feature Behaves Differently at 10 AM ET

· 9 min read
Tian Pan
Software Engineer

Your eval suite ran green at 2 AM PT on a quiet provider. QA smoke-tested at 11 PM the night before launch. The feature goes live, and by Tuesday at 10 AM Eastern your p95 is 40% higher than the dashboard you signed off on, your agent is dropping the last tool call in a six-step plan, and your support inbox is filling with tickets that all sound the same: "the AI was weird this morning." Nobody is wrong. The model is also not wrong. The eval set is wrong — it never saw a saturated provider, so it has no opinion on what the feature does when the queue depth triples and the deadline budget collapses.

Provider load is not a latency problem with a quality side effect. It is a distribution shift in the inputs your model and your agent loop receive, and you have built every quality signal you trust on the wrong half of that distribution. The fix is not a faster region or a better model. The fix is to stop pretending your eval harness is sampling from the same world your users are.

Agent Circuit Breakers: Why Step Budgets Are Fuses, Not Breakers

· 12 min read
Tian Pan
Software Engineer

Every team that ships agents to production eventually wakes up to the same kind of incident. An agent enters a state it cannot exit. It re-calls the same tool with cosmetically different arguments for six hours. It oscillates between two plans whose preconditions reject each other. It retries a transient 429 every two hundred milliseconds until morning. It generates a million-token plan it never executes. By the time anyone notices, the token bill is four figures, the downstream API is rate-limited, the customer's session has timed out twelve times, and the on-call engineer is being paged by three different alerts about the same root cause.

The first fix every team reaches for is a step-count budget. Cap the agent at twenty iterations. Cap it at fifty. Pick a number and ship. The step budget makes the incident reports stop, but it does not make the underlying problem go away — and once you understand the mechanism, you can see why a step budget is the agent equivalent of a household fuse: it blows after the damage has been done, the fuse box itself is now a maintenance burden, and the next time something melts, your reflex is to swap in a higher-rated fuse rather than ask what is actually shorting.

AI Feature Dependency Graphs: When a Prompt Edit Is a Silent Breaking Change

· 12 min read
Tian Pan
Software Engineer

A team owns a summarizer. Another team owns the search ranker that ingests those summaries. A third team owns a router that picks between agent personalities based on the ranker's confidence score. None of these teams have a shared on-call rotation, none of them sit in the same standup, and the only contract between them is "the previous feature's output is the next feature's input." On a Tuesday, the summarizer team tightens a prompt to fix a hallucination complaint from a sales demo. The search ranker's quality collapses six hours later. The router starts handing off to the wrong agent personality by Wednesday morning. The post-mortem will record the cause as "prompt change," but the actual cause is that the team's AI features have quietly composed into a directed graph that nobody drew.

This is the most common shape of an AI outage that doesn't trip any of the alerts you built for AI outages. The model isn't down. The eval suite for the changed feature is green. The token cost line is flat. What broke is the interface between two features, which is a thing your dependency tooling treats as plain text because that's all it is at the API boundary — and treats as inert because plain text doesn't carry a version, a schema, or a deprecation policy.

Annotation Drift: How Your Eval Set Stops Measuring the Product You Ship

· 10 min read
Tian Pan
Software Engineer

The eval set that scored 92% last quarter is now scoring 94%, and the team is calling that progress. It isn't. The labels in that eval set were written against a rubric the annotators no longer hold in their heads. The product the model is being graded on has moved. The standards have moved. The annotators' own calibration has moved. What looks like a two-point improvement is the silent gap between a frozen artifact and a living product, and that gap widens every week the team doesn't refresh.

Annotation drift is the quiet failure mode of mature LLM eval programs. It doesn't show up as a regression — regressions are the easy case, because the number goes down and somebody investigates. It shows up as a number that stays green while the thing it's supposed to measure decays underneath it. Teams that have already built an eval set, written a rubric, and recruited annotators are the most exposed, because they trust the system they built and stop auditing the foundation.

Eval Triage Queues: Why FIFO Misses the Failures That Matter

· 11 min read
Tian Pan
Software Engineer

A healthy eval set is supposed to be a sign of maturity. It is also, on any given Monday, a thousand failed cases sitting in a queue with a human reviewer who has eight hours and a per-case throughput of about fifty. The arithmetic is brutal: roughly one in twenty failures gets read. The other nineteen wait. Which nineteen wait, and which one gets the seat, is decided by whichever order the file happens to load in.

Most teams call this "reviewing failures." It is closer to a lottery weighted by alphabetical order. A failure case that affects two percent of production traffic and lives at the top of the file gets attention. A failure case that affects forty percent of production traffic and lives near the bottom gets a glance on Friday afternoon, if at all. The team ships a fix for the small problem on Tuesday and writes a retro on Thursday wondering why the dashboard hasn't moved.

The Off-Hours Cost Curve: Why Your AI Feature Spends Differently on Saturday Than on Tuesday

· 10 min read
Tian Pan
Software Engineer

The cost dashboard everyone looks at is a weekly rolling average, and that average is lying to you. Not in the sense that the number is wrong — it's a faithful arithmetic mean of a billing event stream — but in the sense that it is hiding the shape of the cost curve underneath. The hours between Friday evening and Monday morning consume tokens differently from the hours between Tuesday at 10am and Thursday at 4pm. The cohort active on Saturday at 3am is not the cohort active on Tuesday at 11am, and the per-user economics of those cohorts diverge by a factor that nobody writes down because the dashboard averaged it away.

Most teams discover this the first time a weekend automation script melts the budget. A LangChain agent gets into an infinite conversation cycle Friday night, runs for the better part of a week before anyone notices, and produces a five-figure invoice that has to be explained to finance on Monday morning. The post-incident review treats it as a one-off — bad retry logic, missing budget cap, didn't page on-call. But the same dashboard that hid the runaway loop is also hiding the steady-state version of the same phenomenon: a baseline of off-hours traffic whose unit economics are structurally worse than the business-hours baseline, every single week, and which the weekly average smooths into invisibility.

Per-Customer Cost Concentration: Why AI Cost Dashboards Hide the Power Law

· 12 min read
Tian Pan
Software Engineer

Your AI feature's cost is a distribution, not a number. The dashboard hanging on the wall of the eng-finance war room says $187,000 last month, broken out by feature, by model, and by region. None of those views answers the question the CFO is actually about to ask: "Who is paying us $40 a month and costing us $4,000?" When you sort by customer_id instead of by feature, the line that was a comfortable bar chart becomes a hockey stick, and the team that designed against the average customer discovers it has been quietly underwriting the top of the tail for a quarter.

The pattern is so consistent it deserves to be called a law. Across production LLM workloads, the top 1% of users routinely drive 30–50% of token spend, with similar shapes showing up at the top 0.1% and the top 0.01%. This isn't a quirk of any one product — it's what happens when you ship a feature whose marginal cost is variable and whose pricing is flat. Average-user margins look fine. Median-user margins look great. The integral over the heavy tail is where the quarter goes.

The Rerun Antipattern: Why Rolling Again Doesn't Find Bugs

· 10 min read
Tian Pan
Software Engineer

The first thing most engineers do when an AI feature misbehaves is click "run" again. The model is stochastic, the thinking goes, so maybe this run was just unlucky. When the second attempt produces something that looks reasonable, the ticket gets closed. The team moves on. The actual bug — a stale tool response, a retrieval miss, a system-prompt conflict that fires only on inputs containing a specific token — sits in production, intact, waiting for the next user to trip it.

This is the rerun antipattern, and it is the most expensive debugging habit AI teams have inherited from the chatbot era. It feels rigorous because the model genuinely is non-deterministic. It looks like a variance probe. But almost no one writes down a hypothesis before they reroll, no one decides in advance how many runs would constitute evidence, and no one accounts for the tokens. What's happening is closer to slot-machine debugging: you pull the lever until the lights stop flashing red, and you walk away convinced the machine is fine.

The Sliding-Window Tax: Why a 30-Turn Conversation Costs More Than 30x a Single Turn

· 9 min read
Tian Pan
Software Engineer

The conversation looks healthy on the dashboard. Average tokens per call is sane, the p50 input length is comfortably inside the cached prefix, the provider invoice ticks up at the rate finance approved. Then someone exports a single 200-turn coding session and the line item for that one user is larger than the rest of the team's daily traffic combined. The dashboard wasn't lying — it was averaging. The bill comes from the long tail, and the long tail does not scale linearly with turn count.

Every multi-turn AI feature eventually meets this surprise. The per-call token count is the wrong unit of measurement, because the cost of a 30-turn conversation is not 30 times the cost of a single turn — it's something between 50× and 200×, depending on how the history is structured, how the prompt cache decays, and what tier the request lands in once the input crosses 200K tokens. The team that priced the feature off the per-call number is underwriting a tail it never modeled.

Snapshot Eval Decay: When Green CI Stops Meaning Your Product Still Works

· 11 min read
Tian Pan
Software Engineer

Six months of green CI is hiding the fact that roughly forty percent of your eval set no longer represents what users actually do with your product. The suite still runs. The judge still scores. The dashboards still glow. But the cases were written against a query distribution, a corpus, a tool surface, and a regulatory text that have all moved underneath them — and a green run now means "yesterday's product still works on yesterday's reality," which is not the question you are paying CI to answer.

This is snapshot eval decay, and it is the slowest, most expensive failure mode in AI evaluation. Slow because the suite never fails — staleness shows up as inability to discriminate between models, not as red builds. Expensive because by the time someone notices that a model swap which the evals approved caused a production regression, the team has already accumulated a year of "we ship when evals pass" muscle memory built on top of an asset that quietly stopped working.

Token Accounting Drift: When Your Trace Logs Don't Match the Provider Invoice

· 9 min read
Tian Pan
Software Engineer

There is a finance meeting that happens at every company shipping a hosted LLM feature, usually around month four. The engineering team has been logging token counts from every request. The finance team has the provider's invoice. The numbers don't agree. Sometimes the gap is five percent. Sometimes it is thirty. The engineers say the invoice is wrong. The finance team says the logs are wrong. Both teams are technically correct, and neither owns the reconciliation.

The drift is not fraud. It is a structural measurement problem, and the structure has at least six independent failure modes that compound. A team that does not own those failure modes will spend the next quarter writing apology emails to FP&A about why the forecast slipped, when the real story is that nobody on the engineering side ever audited what "token" meant in their own logs.

Two-Hop Tool Chains: Why 95% Tools Compose Into 80% Pipelines

· 10 min read
Tian Pan
Software Engineer

The per-tool dashboard in your observability stack tells a comforting lie. search_listings is green at 96%. book_appointment is green at 95%. The agent that uses them back-to-back has been at 78% for three weeks and nobody can explain why. The reason isn't in either tool. It's in the seam between them — the place no dashboard panel exists.

Composition is not addition. When tool A's output flows into tool B's input, the failure surface isn't 1 - (0.96 × 0.95) against B's narrow definition of "valid call." It's the full cartesian product of every way A can be subtly off by B's standards: a date string in MM/DD/YYYY when B expects ISO 8601, a price returned in cents when B parses dollars, a paginated cursor that points one item past the last result, an entity ID that was renamed on the upstream service yesterday. Any of these passes A's own contract tests cleanly. Each one breaks B. The team's per-tool reliability metrics never see it because each tool is, by its own standards, fine.