Skip to main content

143 posts tagged with "evals"

View all tags

Snapshot Tests Lie When Your Model Is Stochastic

· 11 min read
Tian Pan
Software Engineer

The first time a junior engineer on your team types --update-snapshots and pushes to main, your test suite stops being a test suite. It becomes a transcript. The diffs still render in green and red, the CI badge still flips to passing, but the signal has quietly inverted: instead of telling you whether the code is correct, the suite now tells you whether anyone bothered to look at the output. With deterministic code that ratio is acceptably low, because most diffs really are intentional. With a stochastic model on the other end of a network call, the same workflow turns every PR into a coin flip, and every reviewer into a rubber stamp.

Snapshot testing was a beautiful idea for a deterministic world. You record what render(<Button />) produced last Tuesday, you assert that this Tuesday it produces the same string, and any diff is, by definition, a behavior change worth a human eyeball. The pattern survived Jest, Vitest, Pytest, the whole React ecosystem, and a generation of UI snapshot extensions, because the underlying contract held: same input plus same code equals same output. The contract does not hold for an LLM call. Same input plus same code plus same prompt produces a different string, and the difference is not a bug — it is the product working as designed.

Why The Weekly Transcript Review Beats Your AI Dashboard

· 12 min read
Tian Pan
Software Engineer

The most underpriced asset in your AI organization is the hour every week when three people sit in a room and read what your product actually said to users. Not the aggregate scores. Not the rolling averages. Not the dashboard. The actual transcripts. The verbatim outputs. The lazy phrasing the model has quietly settled into. The intent your taxonomy doesn't have a bucket for. The user trying for the third time to express what they want, in three different ways, while your eval rubric scores all three turns "satisfactory."

Teams who institutionalize this hour develop a mental model of their AI feature their dashboards will never surface. Teams who skip it ship for six months on metrics that look fine and learn at the next QBR that the median experience drifted somewhere unfortunate when nobody was looking.

Clarification Budgets: When Your Agent Should Ask Instead of Guess

· 10 min read
Tian Pan
Software Engineer

The two worst agent failure modes feel like opposites, but they originate from the same broken policy. The first agent asks four follow-up questions before doing anything and trains its users to abandon it. The second agent never asks, confidently produces output the user has to redo, and trains its users to mistrust it. Same policy, different settings of one missing parameter: the cost of a question relative to the cost of a wrong answer.

Most agents do not have a policy at all. The model is asked to "be helpful" and is left to negotiate ambiguity on its own. Because next-token prediction rewards committing to an answer, the agent leans toward guessing. Because RLHF rewards politeness, the agent occasionally over-corrects and asks a question for safety. The result is unprincipled behavior that varies from session to session, with no team-level intuition about when the agent will pause and when it will charge ahead.

A clarification budget is the missing parameter. It is a per-task allowance for how much friction the agent is permitted to impose, paired with a decision rule for when a question is worth spending that budget on. Think of it as the conversational analog of a latency budget — every product has one, even if no one wrote it down, and the team that writes it down stops shipping confused agents.

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.

Tool Call Ordering Is a Partial Order, Not a Set

· 10 min read
Tian Pan
Software Engineer

A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.

This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.

The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.

The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor

· 11 min read
Tian Pan
Software Engineer

There is a quiet myth in AI engineering that goes like this: a "minor" model bump — claude-x.6 to claude-x.7, gpt-y.0 to gpt-y.1, the patch-level snapshot rolling forward by a date — should be a drop-in upgrade. The provider releases notes that talk about improved reasoning, lower latency, better tool use. The version number ticks gently. Nothing about the change reads as breaking.

Then it ships. And the on-call channel lights up with reports that the summarizer is now adding a paragraph that wasn't there before, that the JSON extractor is escaping unicode it used to leave alone, that the agent loop is now hitting the max-step ceiling on tasks that used to terminate in three calls. The eval scores look fine in aggregate; the user-visible feature is subtly wrong.

Your Prompts Ship Like Cowboys: Why Code Review Discipline Doesn't Extend to AI Artifacts

· 11 min read
Tian Pan
Software Engineer

Walk through any mature engineering team's PR queue and you will see the same thing: a four-line bug fix attracts three rounds of comments about naming, error handling, and missing test coverage, while a forty-line edit to the system prompt sails through with a single "LGTM, ship it." The author shrugged because the diff looks like documentation. The reviewer shrugged because they have no mental model of what "good" looks like inside that block of English. The result is a prompt change with the blast radius of a feature launch, reviewed at the bar of a typo fix.

This is the quiet quality crisis of every team building with LLMs in production. The codebase has decades of accumulated discipline — linters, type checks, code owners, test gates, deploy windows. The artifacts that actually steer the model — the system prompt, the eval rubric, the tool description, the few-shot exemplars — sit in the same repo and ship through a review process that was designed for English prose. So prompt regressions, eval-rubric drift, and tool-schema breakages land at a quality bar the team would never accept for code.

Why Your Bias Eval Passes in CI and Fails in Deployment

· 10 min read
Tian Pan
Software Engineer

The fairness audit was a green checkmark in the release pipeline. The compliance team signed it off in March. The support tickets started landing in October — a cohort of users in a country the model had never been graded on, getting answers a fraction as useful as everyone else. Nothing about the model had changed. The audit had never been wrong about the model. It had been wrong about the world.

This is the failure mode that no one wants to name out loud: a static bias eval is a snapshot of fairness in a stream that has already drifted. The eval was not lying when it ran. It was telling you a true thing about a distribution that no longer existed. By the time the support team has enough tickets to file a pattern, the model has been unfair to that cohort for two quarters and the audit is a year stale.

The Eval Bottleneck: Your Eval Engineer Is Now the Roadmap

· 11 min read
Tian Pan
Software Engineer

The constraint on your AI roadmap isn't GPU capacity, model availability, or prompt-engineering taste. It's the calendar of one or two engineers who actually know how to build an eval that catches a regression. Every PM with a feature is in their queue. Every model upgrade is in their queue. Every cohort drift, every prompt revision, every "is this judge still calibrated" question lands in the same inbox. And the engineer in question said "no, this isn't ready" three times this quarter, got overruled twice, watched the regression compound in production, and is now updating their LinkedIn.

This is the eval bottleneck, and most orgs don't see it until it bites. Through 2025 the visible scaling story was AI engineers — hire AI engineers, ship AI features, iterate on prompts, swap models. By Q1 2026 the throughput problem moved one layer down. The team that doubled its AI headcount discovered that adding more feature engineers didn't make features ship faster, because every feature still needed an eval, and the eval engineer was the same person.

Eval Differential as Branch Protection: Ship Score Diffs, Not Score Floors

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a clean-looking eval gate: every prompt PR had to score above 0.85 on the golden set or the merge button stayed grey. They were proud of it. Six weeks in, average quality had quietly drifted from 0.93 to 0.87 — every PR cleared the bar, every PR landed, and no individual change owned the regression because none of them broke the rule. The bar was set against a snapshot of last quarter's quality, not against last week's.

That's the failure mode of an absolute-threshold eval gate: a PR that drops the score from 0.92 to 0.86 ships green, while a PR that lifts the score from 0.80 to 0.84 fails the same gate. The team learns "ship if it clears the bar" — a quality story. The signal you actually want is "ship if this change is non-regressive on the slices that matter" — a regression-detector story.

Coverage tools figured this out a decade ago. They report the diff against the parent commit and they break it down per file. Eval gates haven't caught up.

Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season

· 12 min read
Tian Pan
Software Engineer

The dashboard fired its first regression alert on a Monday morning in late January. Quality score on the support assistant dropped three points overnight. No prompt change shipped over the weekend. No model swap. The eval suite — a hand-curated 800-row gold set that the team had built six months earlier — was unchanged. Somebody opened an incident.

Two days of bisecting later, the answer was uninteresting and structural. It was the first business Monday after the IRS opened tax filing for the year. Half the inbound queries had shifted from "where is my paycheck deposit" to "how do I report a 1099-K from a payment app." The eval set, sampled in summer, had nothing to say about a 1099-K. The model wasn't worse. The customer was different. The gate was calibrated against a customer who no longer existed.

This pattern repeats every quarter in every product that has a seasonal user — fintech in tax season, sales tools at end-of-quarter, education at back-to-school, e-commerce in returns season, travel at booking season, healthcare at enrollment season. The eval-set-as-fixed-asset is a comfortable abstraction, and it is wrong on a calendar that nobody updates.

The LLM SDK Upgrade Tax: Why a Patch Bump Is a Model Rollout in Disguise

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a regression to production at 2:14 a.m. on a Tuesday. The on-call alert fired because the JSON parser downstream of their summarization agent was rejecting one in twenty responses with a trailing-comma error. The model hadn't changed. The prompt hadn't changed. The eval suite had passed at 96.4% the night before, comfortably above the 95% gate. What had changed was a single line in package.json: the model provider's SDK had moved from 4.6.2 to 4.6.3. Patch bump. Auto-merged by the dependency bot. The release notes said "internal cleanups."

The "internal cleanup" was a tightened JSON-mode parser that now stripped a forgiving fallback path, which had been quietly fixing a recurring trailing-comma quirk in the model's tool-call output. The model's behavior was unchanged. The SDK's interpretation of that behavior was not. The team's eval suite never saw the regression because the eval suite ran against a different SDK version than the one the dependency bot had just promoted.

This is the LLM SDK upgrade tax, and it is one of the quietest, most expensive failure modes in production AI today. The SDK is not a passive transport. It is an active participant in your prompt's behavior, and the team that upgrades it without an eval is doing a model rollout in disguise.