Skip to main content

131 posts tagged with "evaluation"

View all tags

The Eval Automation Trap: When Your Pipeline Drifts Away From What Users Actually Want

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline scores are trending up. Response quality is improving. The LLM judge is catching more bad outputs. Your dashboard is green.

Meanwhile, a support ticket trickles in: "The assistant keeps giving me long, formal answers when I asked a simple question." Then another: "It stopped suggesting next steps. Used to do that automatically." Then your product manager shows you a chart: user satisfaction down 12% over the last quarter, correlated almost perfectly with the stretch where your automated eval metrics were climbing fastest.

This is the eval automation trap. Your measurement apparatus became optimized for itself rather than for what your users value — and because the feedback loop was entirely automated, nobody noticed until the damage was already in production.

The Eval Migration Tax: Why a Prompt Schema Change Wrecks 800 Test Cases

· 11 min read
Tian Pan
Software Engineer

Every AI team I've watched ship a "small" output schema change has lived through the same week. Someone renames a field in the system prompt — say, summary becomes tldr, or the tool catalog gains a required confidence parameter — and the next CI run lights up red across 800 eval cases that have nothing to do with the change. The prompt diff is fifteen lines. The eval diff is a four-day migration project nobody scoped, owned, or budgeted.

This is the eval migration tax. It is the maintenance cost no roadmap accounts for, paid in delayed releases that get blamed on "flaky tests" rather than the architectural choice that actually caused them. Most teams pay it for years before they recognize the pattern, because each individual incident looks like ordinary churn. The compounding only becomes visible when you tally the engineering hours spent migrating evals across a quarter and realize they exceed the hours spent improving the model behavior the evals were supposed to measure.

The LLM-as-Validator Antipattern: Why Your AI Quality Gate Has a Blind Spot

· 8 min read
Tian Pan
Software Engineer

Your AI feature ships with a quality gate: every response runs through a GPT-4 prompt that scores it on helpfulness, accuracy, and tone. Green scores trigger no alerts. The dashboard shows 97% pass rate. Meanwhile, your support tickets double.

The problem is structural. You used the same class of system that generates your outputs to validate those outputs. When the generator hallucinates a plausible-sounding fact, the judge — trained on the same distribution of internet text — reads the hallucination as credible and passes it through. Both models share the blind spot. Your quality gate is measuring confidence, not correctness.

Abstention as a Routing Decision: Why 'I Don't Know' Belongs in the Router, Not the Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams handle abstention with a single sentence in the system prompt: "If you are not confident, say you don't know." The model occasionally honors it, frequently doesn't, and the failure mode is asymmetric. A confidently-wrong answer ships at full velocity — it lands in the user's hands, gets quoted in a Slack thread, gets cited in a downstream summary. An honest abstention triggers a customer-success escalation because the user expected the agent to handle the request and now somebody has to explain why it didn't. Six months in, the team has learned which kind of failure costs less to ship, and the system prompt edit that nominally controls abstention has been quietly tuned for compliance, not for honesty.

The discipline that fixes this isn't a better wording. It's recognizing that abstention is a routing decision, not a prompt pattern. It deserves a first-class output channel, its own SLO, its own evaluation harness, and its own place in the system topology — somewhere outside the prompt, where it can be tested, owned, and scaled.

The Demo Was a Single Seed: Why Your AI Rollout Is a Variance Problem, Not a Polish Problem

· 11 min read
Tian Pan
Software Engineer

The exec demo went perfectly. The model answered the curated question, the agent completed the workflow, the screen recording is saved on the company drive, and the launch date is now in the calendar. Six weeks later the rollout craters and the post-mortem narrative writes itself: the model needed more polish, the prompt needed more iteration, the team underestimated the work between prototype and production.

That narrative is wrong, and it's expensive, because it sends the team back to do more of the work that already failed. The demo wasn't an under-polished version of production. It was a single sample from a distribution the team never measured. The wow moment was one realization out of thousands the model would generate against the same input, and the team shipped the best one as if it were the typical one. The gap between demo and prod isn't quality drift. It's variance the team hadn't yet seen.

This reframing matters because the fix for a variance problem looks nothing like the fix for a polish problem. Polish says "iterate the prompt, tune the model, hire a better PM." Variance says "you don't know what you have until you sample it n times across the input distribution." The two diagnoses produce different roadmaps, different budgets, and different incident patterns. The teams that ship reliably in 2026 know which problem they have.

Cost-Per-Correctness, Not Cost-Per-Token: The Unit Metric Your Bill Won't Tell You

· 11 min read
Tian Pan
Software Engineer

A team I know cut their inference bill 40% last quarter by migrating their support-email triage flow from a frontier model to a mid-tier one. The CFO sent a thank-you note. Six months later, customer support headcount was up two FTEs and average resolution time had risen 35%. Nobody connected the dots, because the dots lived in different dashboards: the inference bill on the platform team's, the support load on the operations team's. The migration looked like a win on the only metric anyone was tracking. The metric was wrong.

This is the cost-per-token trap. Your invoice tells you what you spent on tokens. It cannot tell you what you spent per correct task, because the inference vendor has no idea what "correct" means in your domain. They sold you raw compute. You bought outcomes — or thought you did. The gap between those two units is where AI unit economics quietly comes apart, and the team that doesn't measure the right denominator is running half the equation and shipping the other half blind.

The Eval-Set Poison Pill: When Your Benchmark Becomes a Backdoor

· 10 min read
Tian Pan
Software Engineer

A team I know spent six months chasing a regression that wasn't there. Every release passed the eval. Every release shipped. Every quarter, NPS on the AI-served cohort drifted down a point. Eventually, an intern doing a routine audit of the gold dataset noticed that one labeler — long since rotated off the contract — had graded 11% of the items, and that those items were systematically more lenient on a specific failure mode the team had been racing to fix. The eval said the model was getting better. The model was not getting better. The eval had been quietly tilted by one human's calibration drift, and nobody had been watching the labelers because nobody believed the labelers were a threat surface.

This is the eval-set poison pill. Most teams treat their eval set as a trusted artifact: the labels were graded by humans, the data came from production, and the regression dashboard is the one thing the org agrees to defer to when shipping. But the labeling pipeline is a human supply chain, and human supply chains are gameable. Treating an eval as ground truth without applying supply-chain hygiene to its inputs is trusting a number whose provenance you cannot defend.

Your Gold Eval Set Has Drifted and Its Pass Rate Is the Reason You Can't See It

· 12 min read
Tian Pan
Software Engineer

The gold eval set passes at 94%. The model has been bumped twice this quarter, the prompt has been edited eleven times, the tool catalog has grown by four, and the dashboard is still green. Then a sales engineer forwards a transcript where the agent confidently routes a customer to a workflow that was sunset two months ago, and the head of support quietly opens a thread asking why the satisfaction scores have been sliding for six weeks while the eval pipeline reports no regressions. The gold set isn't lying. It's measuring last quarter's product against this quarter's traffic, and nobody asked it to do anything else.

This is the failure mode evaluation systems make hardest to see, because the instrument that's supposed to detect quality regressions is itself the source of the false positive. Pass rate is computed against the items in the set; the items in the set were curated against a snapshot of usage; usage moved on; the rate stayed clean. The team trusts the green dashboard, ships another model upgrade, and discovers months later that the production distribution has been measuring something different than the eval set has been measuring for longer than anyone wants to admit.

The fix is not to refresh the gold set more often. Refresh cadence is the wrong knob; the right knob is having a second instrument calibrated to a different time window so disagreement between the two surfaces drift before users do. That second instrument is the shadow eval — a parallel set rebuilt continuously from current production traffic, run alongside the gold set, with the explicit job of disagreeing with it.

The LLM-Judge Ceiling: Why Your Auto-Eval Stops Correlating With Users at the Score That Matters

· 10 min read
Tian Pan
Software Engineer

LLM-as-judge is the productivity unlock that let evaluation coverage scale 10x without growing the human grading team. The problem is that the unlock is not uniform across the score range. The judge's agreement with humans is highest in the muddy middle of the distribution — the answers nobody is going to escalate either way — and collapses on the long tail of high-stakes outputs that actually decide whether a feature ships, gets rolled back, or paged at 2am. The dashboard graph stays green through the score range that nobody is ever happy with.

That is the LLM-judge ceiling: a measurement instrument with a non-uniform error profile that the team is reading as a single number. Aggregate agreement of 80% with humans is the headline most vendors put on the page; it is also the number that gets the team to trust the judge most where the judge is least informative.

The Model-Preference Fork: Why Your Prompt Library Has Three Versions and No One Is Tracking the Drift

· 11 min read
Tian Pan
Software Engineer

Open the prompt library of any team that has been shipping LLM features for more than a year and you will find the same thing: three slightly different versions of every prompt. One was tuned by the engineer who likes Sonnet for its instruction-following. One was rewritten by the engineer who switched to Haiku for the latency budget. One belongs to the prototype that only ever worked on Opus and never got migrated. Each version has a slightly different system message, a different way of describing the tool catalog, a different formatting nudge — and nobody is tracking how they drift.

This is not a hygiene problem. It is a coordination tax that compounds at every model upgrade, and it is silently breaking the relationship between your eval suite and your production traffic. The library is supposed to be a shared resource. In practice, every feature ships with whichever variant the author last tested, the eval suite runs against the variant the eval-author preferred, and the routing layer chooses among them based on cost rather than on which variant was actually validated against the live eval.

The team that doesn't notice is the team that's already paying.

Multilingual Eval Cost Amplification: Why Seven Locales Doesn't Cost 7×

· 14 min read
Tian Pan
Software Engineer

The financial planning spreadsheet for the international launch had a clean line item: "extend eval coverage to seven new locales — assume 7× current eval cost." The English eval suite took two weeks and $40K to build, so seven locales would be $280K and a quarter of engineering time. The CFO signed it. The VP of Product signed it. The launch shipped.

Six months later the actual eval bill had crossed $310K and the team was still standing up the last two locales. The labeling vendor had churned through three replacements for the Portuguese-Brazilian pool because the first two kept producing inter-rater agreement scores an honest review would call random. The German judge model was scoring 6% lower than the English one on the same content — the team initially read this as a German model regression until a manual audit revealed the judge itself was the regression. And the eval lead was spending forty percent of their week on a question nobody had budgeted: how do we know when locale A's pass rate is actually worse than locale B's, versus when our cross-locale measurement is just noisier than the gap?

Prompt Linting Is the Missing Layer Between Eval and Production

· 11 min read
Tian Pan
Software Engineer

The incident report read like a unit-test horror story. A prompt edit removed a five-line safety clause as part of a "preamble cleanup." Every eval in the suite passed. Every judge score held within tolerance. Two weeks later, a customer-facing assistant produced a response that should have been refused, the kind that triggers a Trust & Safety page at 11pm. The post-mortem traced the regression to a single deletion in a PR that nobody had flagged because the suite that was supposed to catch regressions had no opinion on whether the safety clause was present — it only had opinions on whether the model behaved well in the cases the suite remembered to ask about.

This is the gap between behavioral evals and structural correctness. Evals measure what the model produces; they do not measure what the prompt is. And prompts, like code, have a structural layer that exists independently of behavior — sections that must be present, references that must resolve, variables that must interpolate, length budgets that must hold, deprecated identifiers that must not appear. When that structural layer breaks, the behavior often stays green for a while, until the right edge case in production surfaces the failure as an incident.