Skip to main content

118 posts tagged with "llm-ops"

View all tags

The Support Ticket to Eval Case Pipeline Nobody Builds

· 10 min read
Tian Pan
Software Engineer

Every team running an AI feature in production is sitting on the highest-signal eval dataset they will ever have, and they are not using it. The dataset is in Zendesk. Or Intercom. Or Freshdesk, or Help Scout, or whatever queue the support team lives inside. The tickets that get filed there describe the exact failure modes the model produced in front of a paying customer — wrong tone, wrong tool call, wrong policy, hallucinated capability, leaked context. Each one is a labeled negative example, hand-written by the user who experienced the failure, often with reproduction steps and a sentiment annotation attached for free.

The eval suite, meanwhile, lives in Git. It was hand-written by whichever engineer set it up six months ago, and it has accumulated maybe fifty cases since. The intersection between "things the eval suite covers" and "things that actually break in production" is a Venn diagram with a thin sliver of overlap and two large, mutually ignorant lobes.

Time-of-Day Quality Drift: Why Your AI Feature Behaves Differently at 10 AM ET

· 9 min read
Tian Pan
Software Engineer

Your eval suite ran green at 2 AM PT on a quiet provider. QA smoke-tested at 11 PM the night before launch. The feature goes live, and by Tuesday at 10 AM Eastern your p95 is 40% higher than the dashboard you signed off on, your agent is dropping the last tool call in a six-step plan, and your support inbox is filling with tickets that all sound the same: "the AI was weird this morning." Nobody is wrong. The model is also not wrong. The eval set is wrong — it never saw a saturated provider, so it has no opinion on what the feature does when the queue depth triples and the deadline budget collapses.

Provider load is not a latency problem with a quality side effect. It is a distribution shift in the inputs your model and your agent loop receive, and you have built every quality signal you trust on the wrong half of that distribution. The fix is not a faster region or a better model. The fix is to stop pretending your eval harness is sampling from the same world your users are.

Annotation Drift: How Your Eval Set Stops Measuring the Product You Ship

· 10 min read
Tian Pan
Software Engineer

The eval set that scored 92% last quarter is now scoring 94%, and the team is calling that progress. It isn't. The labels in that eval set were written against a rubric the annotators no longer hold in their heads. The product the model is being graded on has moved. The standards have moved. The annotators' own calibration has moved. What looks like a two-point improvement is the silent gap between a frozen artifact and a living product, and that gap widens every week the team doesn't refresh.

Annotation drift is the quiet failure mode of mature LLM eval programs. It doesn't show up as a regression — regressions are the easy case, because the number goes down and somebody investigates. It shows up as a number that stays green while the thing it's supposed to measure decays underneath it. Teams that have already built an eval set, written a rubric, and recruited annotators are the most exposed, because they trust the system they built and stop auditing the foundation.

Asymmetric Eval Economics: Why One Eval Case Costs More Than the Feature It Tests

· 9 min read
Tian Pan
Software Engineer

Here is the awkward truth most AI teams discover six months too late: a single well-designed eval case routinely costs more engineering effort than the feature it is supposed to test. A prompt edit takes an afternoon. The eval case that gives you confidence the prompt edit didn't break something takes a domain expert two days of labeling, a calibration loop with a judge prompt, and a discussion about what "correct" even means for this user surface. The feature ships in a sprint. The eval that lets you ship the next ten features safely takes a quarter to mature.

The asymmetry isn't a bug. It is the structural shape of evaluation work. Labeling, edge-case curation, judge calibration, and rubric design are upfront fixed costs that don't scale with how many features you ship — they scale with how many distinct behaviors you want to verify. Meanwhile the feature side keeps producing what feels like cheap marginal output: "another prompt iteration," "one more tool added to the agent," "swap the model." Each looks individually small. Each silently increases the surface area the eval set must cover.

Per-Customer Cost Concentration: Why AI Cost Dashboards Hide the Power Law

· 12 min read
Tian Pan
Software Engineer

Your AI feature's cost is a distribution, not a number. The dashboard hanging on the wall of the eng-finance war room says $187,000 last month, broken out by feature, by model, and by region. None of those views answers the question the CFO is actually about to ask: "Who is paying us $40 a month and costing us $4,000?" When you sort by customer_id instead of by feature, the line that was a comfortable bar chart becomes a hockey stick, and the team that designed against the average customer discovers it has been quietly underwriting the top of the tail for a quarter.

The pattern is so consistent it deserves to be called a law. Across production LLM workloads, the top 1% of users routinely drive 30–50% of token spend, with similar shapes showing up at the top 0.1% and the top 0.01%. This isn't a quirk of any one product — it's what happens when you ship a feature whose marginal cost is variable and whose pricing is flat. Average-user margins look fine. Median-user margins look great. The integral over the heavy tail is where the quarter goes.

The Rerun Antipattern: Why Rolling Again Doesn't Find Bugs

· 10 min read
Tian Pan
Software Engineer

The first thing most engineers do when an AI feature misbehaves is click "run" again. The model is stochastic, the thinking goes, so maybe this run was just unlucky. When the second attempt produces something that looks reasonable, the ticket gets closed. The team moves on. The actual bug — a stale tool response, a retrieval miss, a system-prompt conflict that fires only on inputs containing a specific token — sits in production, intact, waiting for the next user to trip it.

This is the rerun antipattern, and it is the most expensive debugging habit AI teams have inherited from the chatbot era. It feels rigorous because the model genuinely is non-deterministic. It looks like a variance probe. But almost no one writes down a hypothesis before they reroll, no one decides in advance how many runs would constitute evidence, and no one accounts for the tokens. What's happening is closer to slot-machine debugging: you pull the lever until the lights stop flashing red, and you walk away convinced the machine is fine.

Snapshot Eval Decay: When Green CI Stops Meaning Your Product Still Works

· 11 min read
Tian Pan
Software Engineer

Six months of green CI is hiding the fact that roughly forty percent of your eval set no longer represents what users actually do with your product. The suite still runs. The judge still scores. The dashboards still glow. But the cases were written against a query distribution, a corpus, a tool surface, and a regulatory text that have all moved underneath them — and a green run now means "yesterday's product still works on yesterday's reality," which is not the question you are paying CI to answer.

This is snapshot eval decay, and it is the slowest, most expensive failure mode in AI evaluation. Slow because the suite never fails — staleness shows up as inability to discriminate between models, not as red builds. Expensive because by the time someone notices that a model swap which the evals approved caused a production regression, the team has already accumulated a year of "we ship when evals pass" muscle memory built on top of an asset that quietly stopped working.

The Vendor SLA Gap: Why Your LLM Provider's Uptime Misses the Failure Mode That Breaks Your Product

· 9 min read
Tian Pan
Software Engineer

Your LLM provider says 99.95% availability. Your status page is green. Your latency dashboard is in the SLO. Your product is broken anyway — the assistant started refusing routine requests this morning, the JSON outputs that powered the downstream parser shifted from compact to chatty, and a third of the support tickets you triage with a model are coming back with "I can't help with that." Every one of those responses returned 200 OK in under 800ms. None of them violated the SLA. The SLA covered the failure mode you do not actually have.

This is the gap nobody priced into the procurement conversation. The vendor sells availability — a request-level promise that the API answered in time — and the product team consumes capability, which is a request-level promise that the answer was usable. The two are not the same metric, and the team that confuses them is one quiet model bump away from learning the difference.

Agent Branch Coverage: Your Eval Hits the Happy Path, Not the Planner's If-Else

· 8 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a 240-case eval suite against their support agent. Green across the board for six months. Then they swapped a single sentence in the planner prompt — a tone tweak — and the next day production saw a 3× spike in human-handoff requests. The eval hadn't moved. The handoff branch had simply started firing on borderline cases that used to resolve in-line, and not a single eval case was the kind of borderline. The branch existed in the prompt. It existed in production. It did not exist in the eval.

This is the failure mode I want to name: agent branch coverage. Code-coverage tooling has been a debugging staple for forty years, but agentic systems have a runtime control flow — planner branches that pick a tool, condition the response, escalate to a human, refuse to act, retry with a different strategy — and the eval suite touches only the cases the team thought to write. Eighty percent of the planner's decision branches have never executed under test, and a green eval becomes a smoke test wearing a regression-test costume.

Agent Memory Eviction: Why LRU Survives a Model Upgrade and Salience Doesn't

· 9 min read
Tian Pan
Software Engineer

The team that ships an agent with salience-weighted memory eviction has, without realizing it, signed up for a memory migration project at every model upgrade. The eviction policy looks like a quality lever — pick the smartest scoring approach, get the best recall — but it is secretly a versioning contract. When the scoring model changes, the agent's effective past changes too. None of the tooling teams build around prompts and evals catches it, because the artifact that drifted is not a prompt or an eval. It is a sequence of decisions about what to forget, made months ago, by a model that no longer exists.

LRU and LFU don't have this problem. They are deterministic, model-independent, and survive upgrades cleanly. They also throw away information that a thoughtful judge would have kept. That is the tradeoff most teams accept once, on day one, when a demo recall metric is the only thing being measured — and it is the tradeoff that bites quarterly for the rest of the agent's lifetime.

The Fallback That Became the Default: Why Your Tier Mix Needs an SLO

· 11 min read
Tian Pan
Software Engineer

The dashboard says the fallback fires on 0.5% of requests. The dashboard has been saying that for six months. Then someone re-runs telemetry from scratch and finds the secondary model is serving 38% of traffic and the canned-response tier is serving another 9%. The frontier-model "primary path" the team has been talking about in roadmap reviews is, in fact, the minority experience. Nobody noticed because no single alert ever fired — every demotion was a small, well-justified, locally correct decision, and the cumulative drift never crossed any threshold someone had thought to set.

This is the failure mode I want to name: the fallback that became the default. It is not an outage. It is not a regression in any single component. It is a slow rotation of the product surface where the degraded path stops being a safety net and starts being the experience. The team's mental model and production reality drift apart, and the gap is invisible because the only meters in place are designed to detect failure, not to detect mix.

I'll claim something stronger: if your AI feature has more than two tiers of service, your tier mix is itself an SLO, and if you aren't measuring it, you don't actually know what you ship.

Multi-Axis Agent Bisection: When the Regression Lives in the Interaction

· 11 min read
Tian Pan
Software Engineer

Quality regressed overnight. The on-call engineer pulls up the dashboard, traces a few bad sessions, and starts the obvious bisection: the model provider rotated to a new snapshot at 02:00 UTC, so revert to the pinned older alias. Eval suite still red. Roll back yesterday's prompt change. Still red. Pin the retrieval index back to last week's version. Still red. Each owning team rolls back their own axis in isolation and reports "not us." Three hours in, nobody owns the diagnosis because nobody owns the interaction surface where the regression actually lives — the new model interpreting the new tool description in a way the old model never would have.

This is the failure mode single-axis tooling can't solve. git bisect works because the search space is one-dimensional: a linear sequence of commits. An agent doesn't have one timeline. It has four or five timelines running in parallel — model snapshot, system prompt, tool catalog, retrieval index, sampling config — each with its own owner, its own deploy cadence, and its own "rollback" button that returns just its axis to a known state. The regression you're chasing is often a two-factor interaction, and bisecting along any single axis returns false negatives because the bug only fires on the cross-product cell where the new model meets the new tool description.