Skip to main content

143 posts tagged with "evals"

View all tags

Locale-Stratified Evals: How to Catch Non-English Regressions Your English Test Set Can't See

· 12 min read
Tian Pan
Software Engineer

Your aggregate eval score is up 1.2 points after the last prompt change. Your CSAT on French queries dropped four points the same week. Both numbers are correct. The reason they disagree is that the eval set is 88% English, 6% Spanish, and the rest is a long tail none of which sees enough traffic to move the rollup. The French regression is in your data — it is just sitting at three decimal places below the noise floor of your top-line metric.

This is the most common shape of locale drift I see in production AI systems: not a sudden collapse, not a translated-string bug, but a steady performance gap that the rollup hides and the support queue eventually surfaces. By the time someone in the Paris office forwards a screenshot, you have shipped two more prompt changes on top of the regression and the bisect costs three engineering days.

Model Migration Bills You Twice: The Eval Re-Anchoring Tax Nobody Prices

· 10 min read
Tian Pan
Software Engineer

Every model upgrade gets sold to the team as a swap: a one-line config change, a measurable win on latency or cost or quality, and a few days of prompt re-tuning to absorb the new model's quirks. The procurement deck shows per-token deltas, the engineering ticket lists the rollout phases, and the FP&A team books the quarterly savings. Then the eval scores come in and nobody recognizes them. Quality is flat where it should have moved. Two judges that used to agree are now diverging by ten points. The snapshot suite is red, but the diffs look like rewordings. Somebody in standup asks the question that should have been on the migration plan from day one: what is the model actually scoring against?

This is the second bill — the eval re-anchoring tax — and it is reliably larger than the first. The human-annotated reference scores were anchored to the previous model's output distribution. The LLM-as-judge graders were calibrated against the old model's failure modes. The snapshot fixtures captured the old model's wording. The team's intuition for "good output" was trained on the old model's stylistic tells. None of that survives the swap intact.

On-Call at 3am for an AI Feature That Didn't 500

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 3:02 AM. You squint at your phone expecting the usual: a database failover, a CDN edge that wandered off, a 500 spike from a service nobody touched in eight months. Instead the alert reads: summarizer.eval-on-traffic.helpfulness rolling-1h: 4.21 → 4.05 (Δ -0.16). No HTTP error. No latency spike. No service is down. Every request the system served in the last hour returned a 200 with a body that parsed cleanly. And yet something is unmistakably worse than it was at midnight, and the rotation expects you to figure out what.

This is the on-call shift the standard runbook wasn't written for. The thing that broke didn't break — it regressed. The error budget you've been tracking for years is denominated in availability and latency, and the failure mode that paged you isn't visible in either. The page is real, the customer impact is real, and your usual diagnostic loop — check the deploy log, check the dependency graph, find the bad release, roll it back — runs into a wall the moment you realize that "the bad release" might be a 30-line system-prompt diff that landed at 4 PM yesterday and looked completely innocuous in code review.

The Prompt Graph Inside Your Agent: Cross-Prompt Regression Chains Nobody Mapped

· 11 min read
Tian Pan
Software Engineer

A senior engineer ships a four-word edit to the planner prompt — "if uncertain, ask first." The planner's own eval set, which grades whether plans are reasonable, moves up by half a point. They merge. Two weeks later, the verifier's eval shows a three-point pass-rate regression and nobody can repro it. The root cause turns out to be that the planner now asks more clarifying questions, the executor receives shorter task descriptions on the second turn, the verifier's rubric was implicitly tuned against the previous executor's longer outputs, and an edit nobody flagged as risky has shifted three downstream distributions at once.

This is what happens when you treat the prompts inside an agent as a flat folder of files instead of as a graph with edges. The prompts have owners. The edges between them have nobody.

RAG Against a Phantom Inventory: When Your Corpus Describes Features Your Product Removed

· 11 min read
Tian Pan
Software Engineer

A customer asks your support agent how to do something. The agent retrieves three documentation chunks with high relevance scores, synthesizes a confident answer, and walks the customer through a five-step procedure that ends on a button that hasn't existed for four months. The customer files a ticket. The on-call engineer pulls the eval suite, finds it green, pulls the retrieval traces, finds them green too — the model didn't hallucinate, it faithfully quoted documentation describing a feature your product team renamed in the last quarterly release.

This is the failure mode I want to name: not a hallucination, not a retrieval miss, but a phantom inventory problem. Your retrieval corpus is a snapshot of a product surface that no longer exists. The vector store doesn't know the product changed. The eval suite doesn't know either. The only system that consistently catches it is the support ticket queue, and by the time a ticket is filed the customer has already been told to click a button that isn't there.

Rater Throughput Is the Hidden Bottleneck in Your Eval Pipeline

· 10 min read
Tian Pan
Software Engineer

The team plans an eval suite the way they plan a service: failure modes inventoried, rubric drafted, sample size argued over, judge calibration scheduled. Then they file the rater capacity as a footnote — "we'll get the annotation team to grade a few hundred per week" — and ship the rest. Six weeks later the rater queue is at 4,300 items, eval velocity has collapsed to one judge-calibration cycle per month, and someone in a planning review says the quiet part out loud: nobody capacity-planned the humans.

Rater throughput is the binding constraint on eval velocity in any AI system that takes human grading seriously, and the discipline that treats annotation as an SRE problem rather than a recruiting one is the one that ships. A human reviewer processes 50–100 examples per hour at expert difficulty, and an expert annotator caps out around 500–1,000 examples per week — those numbers are not a recruiting problem to be brute-forced with headcount. They are an operational property of the eval system that has to be modeled and budgeted the way you model database IOPS.

The Second-Draft Agent Pattern: Why Explore-Then-Commit Beats Self-Critique

· 12 min read
Tian Pan
Software Engineer

When a single-pass agent stops being good enough, the default move is to wrap it in a self-critique loop. Generate, critique, revise, repeat. Most teams I talk to assume the eval lift will be roughly linear with the number of revision rounds and stop there. The numbers rarely cooperate. By the third round of self-critique, accuracy is up two or three points and token cost is up 3–4x, and the failure modes that didn't get caught in round one mostly don't get caught in round three either — because the same context that produced the wrong answer is the one being asked to spot the wrongness.

A different shape works better and costs less: let the first pass be wasteful exploration, throw it away, and run a second pass from a clean context with just the lessons learned. Call it the second-draft pattern, or explore-then-commit. The first draft is permitted to be sloppy, to take dead ends, to dump scratch artifacts, to chase hypotheses that turn out to be wrong. The second draft is constrained — it gets the distilled findings and produces a clean execution. On the kinds of tasks where self-critique is tempting (multi-step reasoning, code that touches several files, research syntheses), this two-pass shape often beats n-of-k self-critique on both quality and cost.

Shadow Evals: When Private Slices Replace Your Eval Rollup

· 10 min read
Tian Pan
Software Engineer

The fastest way to discover that your AI team has no eval discipline is to ask three engineers, in separate Slack DMs, "did your last prompt change improve quality?" — and watch them answer yes, all three of them, with three different numbers, against three different slices, on three different laptops, none of which is reproducible by anyone else in the room. That isn't an evals problem in the textbook sense. The textbook says you don't have evals. The reality is worse: you have too many evals, each of them privately owned, each of them measuring something real, and none of them rolling up into a single number the org can plan against.

This is the shadow eval anti-pattern, and most AI teams ship with it for longer than they admit. It looks productive — every engineer has a notebook, every PR comes with a screenshot of a pass rate, every standup mentions a "win on the long-tail slice" — and it survives quarterly reviews because the bar for "we do evals" is so low that running anything counts. But the org has no signal. Leadership cannot tell whether last month's three prompt edits moved the product forward or sideways, because the three engineers measured against three private slices and stopped tracking the previous baseline the moment they switched files.

The Support Ticket to Eval Case Pipeline Nobody Builds

· 10 min read
Tian Pan
Software Engineer

Every team running an AI feature in production is sitting on the highest-signal eval dataset they will ever have, and they are not using it. The dataset is in Zendesk. Or Intercom. Or Freshdesk, or Help Scout, or whatever queue the support team lives inside. The tickets that get filed there describe the exact failure modes the model produced in front of a paying customer — wrong tone, wrong tool call, wrong policy, hallucinated capability, leaked context. Each one is a labeled negative example, hand-written by the user who experienced the failure, often with reproduction steps and a sentiment annotation attached for free.

The eval suite, meanwhile, lives in Git. It was hand-written by whichever engineer set it up six months ago, and it has accumulated maybe fifty cases since. The intersection between "things the eval suite covers" and "things that actually break in production" is a Venn diagram with a thin sliver of overlap and two large, mutually ignorant lobes.

Time-of-Day Quality Drift: Why Your AI Feature Behaves Differently at 10 AM ET

· 9 min read
Tian Pan
Software Engineer

Your eval suite ran green at 2 AM PT on a quiet provider. QA smoke-tested at 11 PM the night before launch. The feature goes live, and by Tuesday at 10 AM Eastern your p95 is 40% higher than the dashboard you signed off on, your agent is dropping the last tool call in a six-step plan, and your support inbox is filling with tickets that all sound the same: "the AI was weird this morning." Nobody is wrong. The model is also not wrong. The eval set is wrong — it never saw a saturated provider, so it has no opinion on what the feature does when the queue depth triples and the deadline budget collapses.

Provider load is not a latency problem with a quality side effect. It is a distribution shift in the inputs your model and your agent loop receive, and you have built every quality signal you trust on the wrong half of that distribution. The fix is not a faster region or a better model. The fix is to stop pretending your eval harness is sampling from the same world your users are.

The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior

· 10 min read
Tian Pan
Software Engineer

A new optional parameter goes into a tool description on a Tuesday. The change is small — six lines in the diff, no breaking signature change, no callers updated, no eval cases touched. The PR description says "adds support for an optional language filter to the existing search tool." Two reviewers approve. It ships.

A week later, the cost dashboard shows that the search tool is being called eighteen percent more often than the prior baseline. Latency on the affected agent has crept up by roughly the same proportion. Nobody can point to a single failing eval. The new parameter, when used, behaves correctly. The new parameter, when not used, doesn't matter. And yet the planner has clearly changed its mind about when to reach for this tool — and the eval suite, which grades tool correctness, has nothing to say about a shift in tool frequency.

Your PRD Is an Untested Prompt — Until You Eval It

· 9 min read
Tian Pan
Software Engineer

Open the system prompt of any AI feature that shipped in the last six months and read it side by side with the PRD that authorized it. You will find two documents arguing with each other. The PRD says "the assistant should be helpful but professional, avoid making things up, and gracefully decline if it can't answer." The system prompt says "You are an AI assistant. Be concise. If you are unsure, say 'I don't know.' Never invent facts." The PRD takes a page. The prompt takes nine lines. The gap between them is where every behavioral bug you shipped this quarter lives.

The convenient fiction is that the prompt is an "implementation detail" of the PRD. The actual relationship is the opposite. The prompt is the contract the model executes; the PRD is a draft of that contract written in a language the model does not speak, by an author who never compiled it. Every PRD for an AI feature is an untested prompt. The team that admits this and runs the PRD through an eval before sign-off ships a feature with one fewer source of post-launch surprise.