Skip to main content

113 posts tagged with "evals"

View all tags

The Second-Draft Agent Pattern: Why Explore-Then-Commit Beats Self-Critique

· 12 min read
Tian Pan
Software Engineer

When a single-pass agent stops being good enough, the default move is to wrap it in a self-critique loop. Generate, critique, revise, repeat. Most teams I talk to assume the eval lift will be roughly linear with the number of revision rounds and stop there. The numbers rarely cooperate. By the third round of self-critique, accuracy is up two or three points and token cost is up 3–4x, and the failure modes that didn't get caught in round one mostly don't get caught in round three either — because the same context that produced the wrong answer is the one being asked to spot the wrongness.

A different shape works better and costs less: let the first pass be wasteful exploration, throw it away, and run a second pass from a clean context with just the lessons learned. Call it the second-draft pattern, or explore-then-commit. The first draft is permitted to be sloppy, to take dead ends, to dump scratch artifacts, to chase hypotheses that turn out to be wrong. The second draft is constrained — it gets the distilled findings and produces a clean execution. On the kinds of tasks where self-critique is tempting (multi-step reasoning, code that touches several files, research syntheses), this two-pass shape often beats n-of-k self-critique on both quality and cost.

Shadow Evals: When Private Slices Replace Your Eval Rollup

· 10 min read
Tian Pan
Software Engineer

The fastest way to discover that your AI team has no eval discipline is to ask three engineers, in separate Slack DMs, "did your last prompt change improve quality?" — and watch them answer yes, all three of them, with three different numbers, against three different slices, on three different laptops, none of which is reproducible by anyone else in the room. That isn't an evals problem in the textbook sense. The textbook says you don't have evals. The reality is worse: you have too many evals, each of them privately owned, each of them measuring something real, and none of them rolling up into a single number the org can plan against.

This is the shadow eval anti-pattern, and most AI teams ship with it for longer than they admit. It looks productive — every engineer has a notebook, every PR comes with a screenshot of a pass rate, every standup mentions a "win on the long-tail slice" — and it survives quarterly reviews because the bar for "we do evals" is so low that running anything counts. But the org has no signal. Leadership cannot tell whether last month's three prompt edits moved the product forward or sideways, because the three engineers measured against three private slices and stopped tracking the previous baseline the moment they switched files.

The Support Ticket to Eval Case Pipeline Nobody Builds

· 10 min read
Tian Pan
Software Engineer

Every team running an AI feature in production is sitting on the highest-signal eval dataset they will ever have, and they are not using it. The dataset is in Zendesk. Or Intercom. Or Freshdesk, or Help Scout, or whatever queue the support team lives inside. The tickets that get filed there describe the exact failure modes the model produced in front of a paying customer — wrong tone, wrong tool call, wrong policy, hallucinated capability, leaked context. Each one is a labeled negative example, hand-written by the user who experienced the failure, often with reproduction steps and a sentiment annotation attached for free.

The eval suite, meanwhile, lives in Git. It was hand-written by whichever engineer set it up six months ago, and it has accumulated maybe fifty cases since. The intersection between "things the eval suite covers" and "things that actually break in production" is a Venn diagram with a thin sliver of overlap and two large, mutually ignorant lobes.

Time-of-Day Quality Drift: Why Your AI Feature Behaves Differently at 10 AM ET

· 9 min read
Tian Pan
Software Engineer

Your eval suite ran green at 2 AM PT on a quiet provider. QA smoke-tested at 11 PM the night before launch. The feature goes live, and by Tuesday at 10 AM Eastern your p95 is 40% higher than the dashboard you signed off on, your agent is dropping the last tool call in a six-step plan, and your support inbox is filling with tickets that all sound the same: "the AI was weird this morning." Nobody is wrong. The model is also not wrong. The eval set is wrong — it never saw a saturated provider, so it has no opinion on what the feature does when the queue depth triples and the deadline budget collapses.

Provider load is not a latency problem with a quality side effect. It is a distribution shift in the inputs your model and your agent loop receive, and you have built every quality signal you trust on the wrong half of that distribution. The fix is not a faster region or a better model. The fix is to stop pretending your eval harness is sampling from the same world your users are.

The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior

· 10 min read
Tian Pan
Software Engineer

A new optional parameter goes into a tool description on a Tuesday. The change is small — six lines in the diff, no breaking signature change, no callers updated, no eval cases touched. The PR description says "adds support for an optional language filter to the existing search tool." Two reviewers approve. It ships.

A week later, the cost dashboard shows that the search tool is being called eighteen percent more often than the prior baseline. Latency on the affected agent has crept up by roughly the same proportion. Nobody can point to a single failing eval. The new parameter, when used, behaves correctly. The new parameter, when not used, doesn't matter. And yet the planner has clearly changed its mind about when to reach for this tool — and the eval suite, which grades tool correctness, has nothing to say about a shift in tool frequency.

Your PRD Is an Untested Prompt — Until You Eval It

· 9 min read
Tian Pan
Software Engineer

Open the system prompt of any AI feature that shipped in the last six months and read it side by side with the PRD that authorized it. You will find two documents arguing with each other. The PRD says "the assistant should be helpful but professional, avoid making things up, and gracefully decline if it can't answer." The system prompt says "You are an AI assistant. Be concise. If you are unsure, say 'I don't know.' Never invent facts." The PRD takes a page. The prompt takes nine lines. The gap between them is where every behavioral bug you shipped this quarter lives.

The convenient fiction is that the prompt is an "implementation detail" of the PRD. The actual relationship is the opposite. The prompt is the contract the model executes; the PRD is a draft of that contract written in a language the model does not speak, by an author who never compiled it. Every PRD for an AI feature is an untested prompt. The team that admits this and runs the PRD through an eval before sign-off ships a feature with one fewer source of post-launch surprise.

Annotation Drift: How Your Eval Set Stops Measuring the Product You Ship

· 10 min read
Tian Pan
Software Engineer

The eval set that scored 92% last quarter is now scoring 94%, and the team is calling that progress. It isn't. The labels in that eval set were written against a rubric the annotators no longer hold in their heads. The product the model is being graded on has moved. The standards have moved. The annotators' own calibration has moved. What looks like a two-point improvement is the silent gap between a frozen artifact and a living product, and that gap widens every week the team doesn't refresh.

Annotation drift is the quiet failure mode of mature LLM eval programs. It doesn't show up as a regression — regressions are the easy case, because the number goes down and somebody investigates. It shows up as a number that stays green while the thing it's supposed to measure decays underneath it. Teams that have already built an eval set, written a rubric, and recruited annotators are the most exposed, because they trust the system they built and stop auditing the foundation.

Asymmetric Eval Economics: Why One Eval Case Costs More Than the Feature It Tests

· 9 min read
Tian Pan
Software Engineer

Here is the awkward truth most AI teams discover six months too late: a single well-designed eval case routinely costs more engineering effort than the feature it is supposed to test. A prompt edit takes an afternoon. The eval case that gives you confidence the prompt edit didn't break something takes a domain expert two days of labeling, a calibration loop with a judge prompt, and a discussion about what "correct" even means for this user surface. The feature ships in a sprint. The eval that lets you ship the next ten features safely takes a quarter to mature.

The asymmetry isn't a bug. It is the structural shape of evaluation work. Labeling, edge-case curation, judge calibration, and rubric design are upfront fixed costs that don't scale with how many features you ship — they scale with how many distinct behaviors you want to verify. Meanwhile the feature side keeps producing what feels like cheap marginal output: "another prompt iteration," "one more tool added to the agent," "swap the model." Each looks individually small. Each silently increases the surface area the eval set must cover.

The Demo-to-Dogfood Gap: Why Your AI Feature Dies Between the Launch Slide and Monday Morning

· 11 min read
Tian Pan
Software Engineer

The demo went perfectly. The room clapped. Two weeks later, the same feature lands in the company Slack for internal use, and by Wednesday a senior engineer is posting screenshots with the caption "did anyone test this?" By Friday the channel has gone quiet — not because the bugs were fixed, but because the people who would have flagged them gave up and went back to their old workflow. The launch is still on the calendar. Nobody has cancelled it. Nobody has the political capital to.

This is the demo-to-dogfood gap, and the MIT NANDA initiative measured it last year at 95% — that is the share of enterprise generative AI pilots that produced no measurable P&L impact, and almost all of them had a demo somebody loved. The model was not the problem. The gap between the demo and the first week of internal use was the problem, and every team that has shipped an AI feature has watched some version of it play out.

The Eval Backfill Tax: Why Every Model Capability Launch Costs More Than You Budgeted

· 9 min read
Tian Pan
Software Engineer

An executive sends a one-line email: "great news — we're adding vision next sprint." The product manager interprets it as a one-week project: swap the model, expose an image parameter, ship. The eval team reads the same email and starts mentally drafting a four-week schedule that nobody has approved yet. By Friday, the disconnect surfaces in standup as a vague "we'll need to do some eval work" and everyone agrees to figure it out later.

That gap between "we added vision" and "we can safely ship vision" is the eval backfill tax. It is the work that quietly falls on the eval team every time a new model capability lands — multimodal input, tool use, longer context, reasoning traces, computer use — because the historical test cases were constructed in a regime where the model could not fail in the new ways the new capability introduces. The suite stays green, the headline benchmark goes up, and the production launch surfaces failure modes nobody wrote a test for.

The Eval Bus Factor: When the Person Who Defined 'Correct' Walks Out the Door

· 10 min read
Tian Pan
Software Engineer

A team I worked with recently lost their senior ML engineer. Two weeks later, the eval suite was still green on every PR — 847 cases, all passing, judge agreement at 92%. Six weeks later, a customer found a regression that should have been caught by the very first eval case in the support-quality bucket. When the team went to debug, nobody could explain why that case had been written, what failure mode it was supposed to catch, or why the judge prompt graded it on a 1–4 scale instead of binary. The case was still passing. It just wasn't testing anything anyone could name.

This is the eval bus factor: the silent failure mode where the person who decided what "correct" means for your AI feature was also the person who curated the test cases, calibrated the judge, and absorbed every implicit labeling tradeoff in their head. When they leave, the suite remains green but stops generating reliable promote/reject signal — because nobody else can extend it, debug a flaky judge, or evaluate whether a new failure mode belongs in the test set.

Eval Triage Queues: Why FIFO Misses the Failures That Matter

· 11 min read
Tian Pan
Software Engineer

A healthy eval set is supposed to be a sign of maturity. It is also, on any given Monday, a thousand failed cases sitting in a queue with a human reviewer who has eight hours and a per-case throughput of about fifty. The arithmetic is brutal: roughly one in twenty failures gets read. The other nineteen wait. Which nineteen wait, and which one gets the seat, is decided by whichever order the file happens to load in.

Most teams call this "reviewing failures." It is closer to a lottery weighted by alphabetical order. A failure case that affects two percent of production traffic and lives at the top of the file gets attention. A failure case that affects forty percent of production traffic and lives near the bottom gets a glance on Friday afternoon, if at all. The team ships a fix for the small problem on Tuesday and writes a retro on Thursday wondering why the dashboard hasn't moved.