Skip to main content

160 posts tagged with "evaluation"

View all tags

The Redaction Layer Your Agent Cannot Reason Through

· 9 min read
Tian Pan
Software Engineer

A privacy review approves your redaction layer. Names, emails, account numbers, phone numbers — all scrubbed before the prompt reaches the model. Your single-turn classifier still hits 94% accuracy. Six weeks later your multi-step agent starts giving confidently wrong answers to questions like "is the email Sarah used to log in the same as the one on her billing record?" and nobody can reproduce it in dev.

The redaction layer did exactly what infosec asked it to do. It also quietly destroyed the property your agent's reasoning depended on: that two mentions of the same entity in different turns refer to the same thing. The agent isn't hallucinating. It's reading a transcript where Sarah has become three different people and the "same" email address has become two distinct placeholders.

The Judge That Agreed With Itself Across A and B

· 10 min read
Tian Pan
Software Engineer

You run an A/B test on two prompt variants. Your LLM judge — same vendor as your candidate model, because it was the easy default — consistently prefers variant A by a margin large enough to call statistically meaningful. You ship A. A week later your satisfaction metric is down, your refund queue is up, and nobody can explain it. Somebody finally re-runs the eval with a judge from a different model family. The preference flips.

The judge was not measuring quality. The judge was measuring how much the candidate sounded like the judge.

Your Eval Set Only Has Problems You Already Solved

· 9 min read
Tian Pan
Software Engineer

Your eval score went from 0.81 to 0.87 over the last quarter. The team shipped a router, swapped in a stronger model on the hard intents, tuned the system prompt, and added forty new test cases harvested from "tickets that took more than a day to close." The dashboard says you got better. NPS is flat. Active users are down two percent.

There is a clean story that explains both numbers, and you don't want to hear it. Your eval set only contains problems you already solved. The queries that failed so badly the user never filed a ticket, never came back, and never showed up in any log you grep — those are not in your suite. They are not in anyone's suite. A rising eval score is consistent with getting better at the things you can see, and it is also consistent with getting better at the things you can see while staying exactly as bad at the things you cannot.

Your Model Router Is a Load Balancer That Cannot See the Load

· 11 min read
Tian Pan
Software Engineer

A load balancer in front of a web fleet works because every machine reports back: CPU, queue depth, error rate, latency. The balancer reads the load and routes accordingly. A model router does not get that telemetry. It decides which model handles a query by looking only at the query, before the model has done anything. The router predicts difficulty from the prompt. Real difficulty only shows up in the answer. By the time the signal exists, the routing decision is already three seconds old and the cheap model has already shipped a confident, wrong reply to your user.

This is the structural defect at the center of model routing, and most teams ship a router without ever framing it this way. They frame it as a classifier — train a model to label queries as "easy" or "hard," validate it on a held-out set, ship when accuracy clears 90%. The classifier metaphor is wrong in a way that matters. A classifier predicts a label that already exists. The router is predicting a label that does not exist yet, will not exist until the routed model has answered, and may never exist in a form clean enough to learn from.

A Prompt Diff Hides Its Own Blast Radius

· 9 min read
Tian Pan
Software Engineer

A pull request lands in your review queue. The diff shows three words changed inside a system prompt: Output strictly valid JSON became Always respond using clean, parseable JSON. It reads like a copy edit. You skim it, the CI checkmark is green, and you click approve. Total time: ninety seconds.

Six hours later, the downstream parser starts rejecting responses with trailing commas and missing fields. The structured-output error rate climbs from near-zero to double digits, and a revenue-generating workflow stalls. Nothing in the diff predicted this. Nothing in the diff could have predicted this, because the diff measured the wrong thing.

This is the central problem with reviewing prompt changes: the size of a prompt diff tells you nothing about the size of its effect. A three-word change and a three-paragraph rewrite are both just text, and a text diff renders them with the same visual weight as any other edit. But a prompt is not text that describes behavior — it is text that causes behavior, and the causal blast radius of an edit is invisible in the artifact you are reviewing.

Hiring for AI Roles That Have No Career Ladder Yet

· 9 min read
Tian Pan
Software Engineer

You open a requisition for an "eval engineer." A week later your recruiter asks the obvious question: what level is this, and what does a good resume look like? You don't have an answer. The title didn't exist two years ago. There is no leveling rubric, no canonical interview loop, no pool of people with the words "eval engineer" already on their LinkedIn. You are hiring for a job the industry has not agreed exists.

This is the quiet bottleneck in shipping AI systems. The model is available. The infrastructure is rentable. What you cannot buy off the shelf is the person whose actual job is keeping a prompt-driven system honest — and your hiring machinery, built for roles with decades of precedent, has no slot for them.

The instinct is to wait. Wait for the title to standardize, for the bootcamps to mint candidates, for someone else to write the leveling guide you can copy. That instinct is wrong. The work exists now whether or not the title does, and the teams staffing it now are the ones learning what "good" looks like before their competitors even open the req.

The Eval Set That Got Easier While You Weren't Looking

· 9 min read
Tian Pan
Software Engineer

You wrote the eval set eighteen months ago. Back then it was a useful instrument: the cheap model scored 71%, the better one scored 84%, and when a regression slipped in the number dropped and somebody noticed. The suite earned its place in CI. You stopped thinking about it.

Run it today and every candidate model scores 96, 97, 98. The new release scores the same as the old one. The model you suspect is worse scores the same as the model you suspect is better. The number still renders green in the dashboard, the check still passes, and it tells you exactly nothing. Your eval set didn't break. It got easier — because the models got better underneath it — and nobody was watching the moment it stopped discriminating.

This is eval saturation, and it is not a failure mode you might hit. It is the guaranteed end state of any static suite given a long enough timeline. A test that everything passes has stopped being a test.

The Eval Set Is a Lagging Indicator: Your Green Dashboard Only Knows Last Quarter's Failures

· 8 min read
Tian Pan
Software Engineer

Every mature AI team builds its eval suite the same way, and almost nobody says the quiet part out loud. A failure shows up in production. Someone writes a postmortem. An engineer distills the incident into a test case, adds it to the eval suite, and the dashboard goes green again. Repeat this loop for a year and you have a few hundred cases, a satisfying pass rate, and a deeply comforting number to put on a slide.

Here is the quiet part: that suite is a museum. Every exhibit is a failure class the team has already survived. A 98% pass rate certifies your system against the past — against the specific ways it has already broken — and says almost nothing about the novel failure mode that a model migration, a prompt edit, or a shift in user behavior is about to introduce. The eval set is a lagging indicator wearing the costume of a leading one.

The Eval Suite That Became the Spec Nobody Agreed To

· 8 min read
Tian Pan
Software Engineer

Open any mature agent codebase and ask a simple question: where is the requirements document? Not the pitch deck, not the launch doc, not the Notion page that was last touched in Q3. Where is the artifact that says, concretely and unambiguously, what this agent is supposed to do?

For most teams, the honest answer is the eval suite. There is a folder of test cases — inputs paired with expected outputs, rubrics, judge prompts — and a CI gate that says pass or fail. That folder is the only place where "correct" is defined precisely enough to be executed. Everything else is prose, and prose drifts.

This is not inherently bad. An executable spec is more honest than a PRD that nobody reads. The problem is that almost nobody treats the eval suite as a spec. It was assembled by one engineer, under deadline, to make a release gate go green. It encodes a hundred judgment calls that were never written down, never reviewed, and never agreed to. And the model is now optimized precisely to it.

The Happy Path Is the Only Path Your Agent Eval Ever Tested

· 10 min read
Tian Pan
Software Engineer

Look at where most agent eval sets come from. Someone builds the agent, demos it to the team, the demo works, and the demo script becomes the eval suite. The cases that pass review are the cases someone already watched pass. The eval set is, almost by construction, a recording of the happy path — the one tool sequence that worked the day the screenshot was taken.

So when the dashboard says the agent scores 94%, what it actually says is: it passes the cases we imagined. It says nothing about the case where the search API returns a 429 in the middle of a multi-step plan, where the user contradicts a constraint they stated two turns ago, or where retrieval comes back empty and the agent has to decide between guessing and admitting it doesn't know. Those cases aren't failing your eval. They were never in it.

This is golden-path bias, and it is the default shape of an agent eval suite unless you fight it deliberately. The fix is not more cases. It is different cases — chosen by failure mode, harvested from production, and stress-tested with deliberate faults.

The Model Reached End of Life and Took Your Prompt With It

· 10 min read
Tian Pan
Software Engineer

A deprecation notice looks harmless. It arrives as a calm paragraph in a changelog or an email: this model snapshot will be removed from the API on a date a few months out, here is the recommended replacement, thank you for building with us. The implied work is a one-line change — swap the model string, redeploy, done.

That framing is wrong, and it is wrong in an expensive way. The model string is the smallest thing you are losing. The thing that actually leaves with the old model is the prompt you spent six months tuning — every edge-case patch, every reordered instruction, every "respond only with valid JSON, do not wrap it in markdown" you added because that specific model did that specific annoying thing. None of that was portable. It was fitted, in the statistical sense, to one model's behavior. The replacement is not bug-for-bug compatible, so the fit no longer holds.

A model end-of-life is a migration project. Treat it as a config change and you will discover the difference in production, on the new model, with real traffic.