Skip to main content

134 posts tagged with "evals"

View all tags

Persona Overlays: When One Agent Needs Many Voices for Different Customer Cohorts

· 11 min read
Tian Pan
Software Engineer

A Fortune 500 procurement lead opens your support agent and asks why the SOC 2 report references a control your product no longer implements. Your agent answers in the same chipper voice it uses with hobbyists on the free tier — three exclamation points, an emoji, and a cheerful suggestion to "ping our team" with no escalation path or citation. The procurement lead forwards the screenshot to her CISO with one line: "This is who they sent to handle our compliance question." You lose the renewal not because the answer was wrong, but because the voice was wrong for the room.

Most teams ship one agent persona because the org chart has one support team. The customer base, however, is rarely that uniform. Enterprise buyers expect formality, citations, and named escalation paths. Self-serve users want quick answers and zero friction. Developers want code, not paragraphs. The single-persona agent reads as condescending to one cohort and unprofessional to another, and "let users pick a tone" punts a product decision to the user that the user shouldn't have to make.

The PRD for an AI Feature: Why Your Old Template Misses the Cliff

· 10 min read
Tian Pan
Software Engineer

The deterministic-software PRD template has aged into a kind of muscle memory. Problem statement, user stories, acceptance criteria, edge cases, success metrics, scope cuts. Engineers know how to read it. PMs know how to fill it in. Designers know which sections to lift wireframes from. It is a well-worn artifact that has shipped a generation of CRUD apps, dashboards, and SaaS workflows.

It also has no field for "what the model gets wrong five percent of the time." No field for "what we accept as a passing eval score." No field for "what the user sees when the model refuses to answer." No field for "which prompt version this PRD locks down, and who is allowed to change it after ship." Every AI feature shipped against that template is shipping with a hidden contract that nobody wrote down. Postmortems keep finding it the hard way.

Silent Quantization: Why the Model You Pay For Today Isn't the Model You Paid For Last Quarter

· 11 min read
Tian Pan
Software Engineer

The model name on your invoice is the same as it was last quarter. The version string in the API response hasn't changed. The model card and pricing page read identically. And yet your eval scores have drifted half a point downward, your refusal patterns shifted in ways your prompts didn't ask for, and a handful of customer complaints came in last Tuesday about output that "feels different." You debug your code. You don't find anything. The code didn't change. The weights did.

Silent quantization is the gap between the model you contracted for and the model the provider is actually serving. It happens because inference economics keep tightening — every dollar of GPU capacity has to feed more requests this quarter than last — and the cheapest way to absorb that pressure is to re-host the same model name on cheaper precision tiers. FP16 becomes FP8. FP8 becomes FP4 in some routes. Mixed-precision shards get swapped in. The version string doesn't move because the version string was never a precision contract; it was a marketing contract.

Snapshot Tests Lie When Your Model Is Stochastic

· 11 min read
Tian Pan
Software Engineer

The first time a junior engineer on your team types --update-snapshots and pushes to main, your test suite stops being a test suite. It becomes a transcript. The diffs still render in green and red, the CI badge still flips to passing, but the signal has quietly inverted: instead of telling you whether the code is correct, the suite now tells you whether anyone bothered to look at the output. With deterministic code that ratio is acceptably low, because most diffs really are intentional. With a stochastic model on the other end of a network call, the same workflow turns every PR into a coin flip, and every reviewer into a rubber stamp.

Snapshot testing was a beautiful idea for a deterministic world. You record what render(<Button />) produced last Tuesday, you assert that this Tuesday it produces the same string, and any diff is, by definition, a behavior change worth a human eyeball. The pattern survived Jest, Vitest, Pytest, the whole React ecosystem, and a generation of UI snapshot extensions, because the underlying contract held: same input plus same code equals same output. The contract does not hold for an LLM call. Same input plus same code plus same prompt produces a different string, and the difference is not a bug — it is the product working as designed.

Why The Weekly Transcript Review Beats Your AI Dashboard

· 12 min read
Tian Pan
Software Engineer

The most underpriced asset in your AI organization is the hour every week when three people sit in a room and read what your product actually said to users. Not the aggregate scores. Not the rolling averages. Not the dashboard. The actual transcripts. The verbatim outputs. The lazy phrasing the model has quietly settled into. The intent your taxonomy doesn't have a bucket for. The user trying for the third time to express what they want, in three different ways, while your eval rubric scores all three turns "satisfactory."

Teams who institutionalize this hour develop a mental model of their AI feature their dashboards will never surface. Teams who skip it ship for six months on metrics that look fine and learn at the next QBR that the median experience drifted somewhere unfortunate when nobody was looking.

Clarification Budgets: When Your Agent Should Ask Instead of Guess

· 10 min read
Tian Pan
Software Engineer

The two worst agent failure modes feel like opposites, but they originate from the same broken policy. The first agent asks four follow-up questions before doing anything and trains its users to abandon it. The second agent never asks, confidently produces output the user has to redo, and trains its users to mistrust it. Same policy, different settings of one missing parameter: the cost of a question relative to the cost of a wrong answer.

Most agents do not have a policy at all. The model is asked to "be helpful" and is left to negotiate ambiguity on its own. Because next-token prediction rewards committing to an answer, the agent leans toward guessing. Because RLHF rewards politeness, the agent occasionally over-corrects and asks a question for safety. The result is unprincipled behavior that varies from session to session, with no team-level intuition about when the agent will pause and when it will charge ahead.

A clarification budget is the missing parameter. It is a per-task allowance for how much friction the agent is permitted to impose, paired with a decision rule for when a question is worth spending that budget on. Think of it as the conversational analog of a latency budget — every product has one, even if no one wrote it down, and the team that writes it down stops shipping confused agents.

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.

Tool Call Ordering Is a Partial Order, Not a Set

· 10 min read
Tian Pan
Software Engineer

A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.

This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.

The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.

The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor

· 11 min read
Tian Pan
Software Engineer

There is a quiet myth in AI engineering that goes like this: a "minor" model bump — claude-x.6 to claude-x.7, gpt-y.0 to gpt-y.1, the patch-level snapshot rolling forward by a date — should be a drop-in upgrade. The provider releases notes that talk about improved reasoning, lower latency, better tool use. The version number ticks gently. Nothing about the change reads as breaking.

Then it ships. And the on-call channel lights up with reports that the summarizer is now adding a paragraph that wasn't there before, that the JSON extractor is escaping unicode it used to leave alone, that the agent loop is now hitting the max-step ceiling on tasks that used to terminate in three calls. The eval scores look fine in aggregate; the user-visible feature is subtly wrong.

Your Prompts Ship Like Cowboys: Why Code Review Discipline Doesn't Extend to AI Artifacts

· 11 min read
Tian Pan
Software Engineer

Walk through any mature engineering team's PR queue and you will see the same thing: a four-line bug fix attracts three rounds of comments about naming, error handling, and missing test coverage, while a forty-line edit to the system prompt sails through with a single "LGTM, ship it." The author shrugged because the diff looks like documentation. The reviewer shrugged because they have no mental model of what "good" looks like inside that block of English. The result is a prompt change with the blast radius of a feature launch, reviewed at the bar of a typo fix.

This is the quiet quality crisis of every team building with LLMs in production. The codebase has decades of accumulated discipline — linters, type checks, code owners, test gates, deploy windows. The artifacts that actually steer the model — the system prompt, the eval rubric, the tool description, the few-shot exemplars — sit in the same repo and ship through a review process that was designed for English prose. So prompt regressions, eval-rubric drift, and tool-schema breakages land at a quality bar the team would never accept for code.

Why Your Bias Eval Passes in CI and Fails in Deployment

· 10 min read
Tian Pan
Software Engineer

The fairness audit was a green checkmark in the release pipeline. The compliance team signed it off in March. The support tickets started landing in October — a cohort of users in a country the model had never been graded on, getting answers a fraction as useful as everyone else. Nothing about the model had changed. The audit had never been wrong about the model. It had been wrong about the world.

This is the failure mode that no one wants to name out loud: a static bias eval is a snapshot of fairness in a stream that has already drifted. The eval was not lying when it ran. It was telling you a true thing about a distribution that no longer existed. By the time the support team has enough tickets to file a pattern, the model has been unfair to that cohort for two quarters and the audit is a year stale.

The Eval Bottleneck: Your Eval Engineer Is Now the Roadmap

· 11 min read
Tian Pan
Software Engineer

The constraint on your AI roadmap isn't GPU capacity, model availability, or prompt-engineering taste. It's the calendar of one or two engineers who actually know how to build an eval that catches a regression. Every PM with a feature is in their queue. Every model upgrade is in their queue. Every cohort drift, every prompt revision, every "is this judge still calibrated" question lands in the same inbox. And the engineer in question said "no, this isn't ready" three times this quarter, got overruled twice, watched the regression compound in production, and is now updating their LinkedIn.

This is the eval bottleneck, and most orgs don't see it until it bites. Through 2025 the visible scaling story was AI engineers — hire AI engineers, ship AI features, iterate on prompts, swap models. By Q1 2026 the throughput problem moved one layer down. The team that doubled its AI headcount discovered that adding more feature engineers didn't make features ship faster, because every feature still needed an eval, and the eval engineer was the same person.