Skip to main content

32 posts tagged with "llm-evaluation"

View all tags

The A/B Test Winner Whose Verbose Output Triggered Your Click Handler More Than the Better Answer

· 10 min read
Tian Pan
Software Engineer

A prompt-variant experiment runs on the production traffic of an AI-assisted search product. The success metric is a click on any suggested action in the response. Variant B ships responses that are roughly forty percent longer with more enumerated options. The click-through rate is eleven percent higher with three nines of statistical significance. The experiment is declared a winner and shipped.

A month later, the weekly customer satisfaction survey drops two points. Nobody connects it to the launch because the experiment has already been written up as a success and the team has moved on. A quarterly review eventually traces the satisfaction drop back to the prompt change, and the diagnosis lands hard: variant B won not because it gave users better answers but because longer answers contained more clickable surfaces. The click handler fired more often per impression because there was more to click, not because what the user read was more worth acting on.

The Thumbs-Up Button That Poisoned Your Eval Set Through the Back Door

· 11 min read
Tian Pan
Software Engineer

A thumbs-up button is the cheapest signal you will ever instrument. It is also one of the most dangerous, because nothing about it announces that it is reshaping the distribution your eval set is supposed to represent. The button is collected as a positive — the curation pipeline reads it as quality — and six months later the eval is dominated by examples chosen by a cohort that does not include the customers most likely to churn.

The failure rarely shows up as a regression. It shows up as a divergence: weekly eval trends up, the enterprise tier's NPS slides, and the team only diagnoses the gap when a churned account names the specific kind of question their team kept getting wrong. The eval set has no examples shaped like it. The signal you were optimizing was real. It was just measuring the wrong distribution.

The Deterministic Seed Your Eval Suite Set That Your Provider Quietly Ignored

· 11 min read
Tian Pan
Software Engineer

You set seed=42. You set temperature=0. You logged the run, posted the dashboard, signed off on the model swap. The next morning the rerun returned a different number on the same prompts, and the explanation you reached for — "must be sampling noise" — was wrong twice over: there was no sampling, and the noise was structural. The seed left your client, the gateway threw it away, the kernel batched your request next to seventeen unrelated ones, and the floating-point reduction order changed under you. Your "reproducible" benchmark was always within one batch of being a different benchmark.

This failure mode is quiet because every layer in the stack is technically correct. The SDK accepts the seed. The provider documents the seed. The model returns a system_fingerprint. The eval harness logs all three. Nothing 5xx's, nothing warns, nothing protests. The number on the dashboard just shifts, and the team rationalizes the shift as the kind of jitter that always existed — because they have no instrument that can tell them whether they're looking at stochastic decoding or at a backend rotation that invalidated three weeks of comparisons.

The Shadow Deploy That Proved Nothing: When Parallel Calls Miss the Conversation

· 9 min read
Tian Pan
Software Engineer

A shadow deployment is the validation everyone agrees is responsible. You mirror live traffic into a candidate model, log its output, never show the result to the user. The dashboards line up, the candidate's responses look as good as the incumbent's on aggregate quality metrics, the team gets a green signal that the new model is "production-equivalent," and you promote it to a small slice of real traffic. Within a day, user-facing metrics collapse on a class of queries the shadow run had rated as matched.

The team's first instinct is to blame the rollout: maybe a feature flag misfired, maybe a router routed wrong, maybe the new model is silently degraded in production in a way it wasn't in shadow. None of those are true. The shadow worked exactly as designed. What the team measured was the candidate model's output in isolation — a string against a string — and what got promoted was a candidate model whose output reshapes the next user message, the next turn, the abandonment decision, and the path through the rest of the session. The shadow measured the model. Production measures the conversation. Those are not the same unit.

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.

The Success Metric That Improved Because the Model Declined the Hard Cases

· 9 min read
Tian Pan
Software Engineer

You bumped the model on Tuesday. By Friday, the "task completion rate" dashboard had climbed from 71% to 78%. Leadership noticed. Someone screenshotted it for the all-hands. Two weeks later, support quietly flagged that churn on a specific cohort of complex tickets had doubled. Nobody connected the two events because, on paper, the agent got better. In reality, the new model just got better at refusing.

This is the metric decoupling problem, and it is one of the most expensive ways an LLM-powered product can deceive its own builders. Your success rate did not measure what you thought it measured. It measured the intersection of what the model attempted and what the model got right when it attempted. When a model upgrade, a prompt change, or a safety-tuning pass shifts the boundary of "attempted," your numerator and your denominator move together — and the ratio can go up even as user-perceived quality falls off a cliff.

Re-Ask Rate: The Failure Signal Your Eval Pipeline Never Extracts

· 10 min read
Tian Pan
Software Engineer

Open any production chat transcript long enough and you will find a user who asks the same question three times. The phrasing changes a little each turn — pronouns swap to nouns, a clarifier gets bolted on, the polite hedge falls away by the third try — but the underlying request is identical. They are not asking three questions. They are asking the same question, and the agent is failing to answer it, and the user is hoping that this time the words will land differently.

The transcript-level signal here is so loud it is almost obscene. The user has told you, with their own keystrokes, that the previous response did not help. They did not need to fill out a survey. They did not need to leave a thumbs-down. They told you by typing the question again. And in most production AI stacks, this signal is silently discarded by an eval pipeline that scores each turn in isolation and a satisfaction survey that only fires at session end — by which point the user who re-asked three times has usually already churned and will never grade anything.

The Agent Feedback Loop You Never Built

· 9 min read
Tian Pan
Software Engineer

Every day your agent ships failures back to you, gift-wrapped. A user clicks thumbs-down. Another reads the answer, says nothing, and closes the tab. A third rephrases the same question three times until the agent finally gets it. Each of those is a labeled failure case — a real input, a real context, a real moment where the system fell short — handed to you for free by the people who care most about getting it right.

Most teams throw all of it away. Not deliberately. The thumbs-down increments a dashboard counter. The abandonment shows up as a dip in a retention chart. The rephrasing looks like ordinary usage. Nothing captures the signal together with the context that produced it, so nothing can be replayed, triaged, or turned into a test. The richest source of evaluation data you will ever have flows past untouched, and the team keeps writing synthetic eval cases by hand.

This is the agent feedback loop you never built. It is not a tool you forgot to buy. It is a pipeline — from user signal, to triaged failure, to new eval case — and the reason it stays unbuilt has very little to do with technology.

Your Eval Set Is a Frozen Photograph of Traffic Your Users Already Left

· 10 min read
Tian Pan
Software Engineer

You shipped a model upgrade. The eval suite went from 87% to 91%. The release notes wrote themselves, leadership clapped, and then the dashboards that actually matter — user satisfaction, escalation rate, thumbs-down ratio — did nothing. Flat. Maybe slightly worse.

This is one of the most disorienting failure modes in AI engineering, because nothing is broken. The eval ran correctly. The numbers are real. The model genuinely improved on the 600 examples you tested it against. The problem is that those 600 examples are a photograph of traffic from the week you built the suite, and your users have spent the months since then walking out of frame.

The LLM Judge Is a Versioned Dependency, Not Neutral Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most teams treat their LLM judge the way they treat a unit-test runner: neutral infrastructure that produces a number you can trust. You write a rubric, point a model at your outputs, and the judge returns scores. The scores go on a dashboard. The dashboard's trendline drives the roadmap. Nobody thinks of the judge as a thing that has behavior, because the whole point of automation was to take behavior out of the loop.

But the judge is a model. It has a version. It has biases. And the day it changes — because your eval-platform team swapped it for something cheaper, or because the provider silently rolled the weights behind a -latest alias — every historical score it produced becomes incomparable to every new one. Your quarter-over-quarter quality trend is now denominated in two different currencies, and no one printed an exchange rate.

This is not a hypothetical edge case. It is the default outcome of using an LLM as a measurement instrument without versioning it like one.

When LLMs Review LLMs, Errors Get Laundered Not Caught

· 10 min read
Tian Pan
Software Engineer

Trace the path of a single quality signal through a modern AI pipeline. An agent drafts a response. A second model reviews it and scores it 9 out of 10. That score gets logged. At the end of the quarter, the logged scores become the new eval set, and the next model is tuned to do well against it. Now ask the obvious question: where in that loop did a human ever look at the actual output?

In a lot of pipelines, the honest answer is nowhere. The agent that does the work is reviewed by another agent, and that reviewer's verdict feeds the next round of evaluation. The loop is closed. It runs continuously, it produces a dashboard, and the dashboard is green. What it does not contain, at any point, is a measurement against reality.

The Unhelpful-but-Safe Failure: When Refusal Rate Is the Wrong Safety Metric

· 10 min read
Tian Pan
Software Engineer

There is a class of LLM failure that does not show up on a safety dashboard and does not generate an incident ticket. The model declines politely. It cites a reasonable-sounding policy. It offers a four-paragraph hedge instead of an answer. The user closes the tab. The trust score in the postmortem reads "no incident." The retention chart, six weeks later, says otherwise.

Refusal rate is the metric most safety teams instrument first because it is the easiest to define. A model either complied or did not, and you can count the "did nots." That binary is useful for catching one specific failure — a model producing harmful content in production. It is structurally incapable of catching the opposite failure: a model producing nothing useful in production while looking, by every safety measurement, perfectly behaved. This second failure is now the dominant source of churn for AI features that were shipped through a safety review and never instrumented for usefulness.