Skip to main content

28 posts tagged with "llm-evaluation"

View all tags

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.

The Success Metric That Improved Because the Model Declined the Hard Cases

· 9 min read
Tian Pan
Software Engineer

You bumped the model on Tuesday. By Friday, the "task completion rate" dashboard had climbed from 71% to 78%. Leadership noticed. Someone screenshotted it for the all-hands. Two weeks later, support quietly flagged that churn on a specific cohort of complex tickets had doubled. Nobody connected the two events because, on paper, the agent got better. In reality, the new model just got better at refusing.

This is the metric decoupling problem, and it is one of the most expensive ways an LLM-powered product can deceive its own builders. Your success rate did not measure what you thought it measured. It measured the intersection of what the model attempted and what the model got right when it attempted. When a model upgrade, a prompt change, or a safety-tuning pass shifts the boundary of "attempted," your numerator and your denominator move together — and the ratio can go up even as user-perceived quality falls off a cliff.

Re-Ask Rate: The Failure Signal Your Eval Pipeline Never Extracts

· 10 min read
Tian Pan
Software Engineer

Open any production chat transcript long enough and you will find a user who asks the same question three times. The phrasing changes a little each turn — pronouns swap to nouns, a clarifier gets bolted on, the polite hedge falls away by the third try — but the underlying request is identical. They are not asking three questions. They are asking the same question, and the agent is failing to answer it, and the user is hoping that this time the words will land differently.

The transcript-level signal here is so loud it is almost obscene. The user has told you, with their own keystrokes, that the previous response did not help. They did not need to fill out a survey. They did not need to leave a thumbs-down. They told you by typing the question again. And in most production AI stacks, this signal is silently discarded by an eval pipeline that scores each turn in isolation and a satisfaction survey that only fires at session end — by which point the user who re-asked three times has usually already churned and will never grade anything.

The Agent Feedback Loop You Never Built

· 9 min read
Tian Pan
Software Engineer

Every day your agent ships failures back to you, gift-wrapped. A user clicks thumbs-down. Another reads the answer, says nothing, and closes the tab. A third rephrases the same question three times until the agent finally gets it. Each of those is a labeled failure case — a real input, a real context, a real moment where the system fell short — handed to you for free by the people who care most about getting it right.

Most teams throw all of it away. Not deliberately. The thumbs-down increments a dashboard counter. The abandonment shows up as a dip in a retention chart. The rephrasing looks like ordinary usage. Nothing captures the signal together with the context that produced it, so nothing can be replayed, triaged, or turned into a test. The richest source of evaluation data you will ever have flows past untouched, and the team keeps writing synthetic eval cases by hand.

This is the agent feedback loop you never built. It is not a tool you forgot to buy. It is a pipeline — from user signal, to triaged failure, to new eval case — and the reason it stays unbuilt has very little to do with technology.

Your Eval Set Is a Frozen Photograph of Traffic Your Users Already Left

· 10 min read
Tian Pan
Software Engineer

You shipped a model upgrade. The eval suite went from 87% to 91%. The release notes wrote themselves, leadership clapped, and then the dashboards that actually matter — user satisfaction, escalation rate, thumbs-down ratio — did nothing. Flat. Maybe slightly worse.

This is one of the most disorienting failure modes in AI engineering, because nothing is broken. The eval ran correctly. The numbers are real. The model genuinely improved on the 600 examples you tested it against. The problem is that those 600 examples are a photograph of traffic from the week you built the suite, and your users have spent the months since then walking out of frame.

The LLM Judge Is a Versioned Dependency, Not Neutral Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most teams treat their LLM judge the way they treat a unit-test runner: neutral infrastructure that produces a number you can trust. You write a rubric, point a model at your outputs, and the judge returns scores. The scores go on a dashboard. The dashboard's trendline drives the roadmap. Nobody thinks of the judge as a thing that has behavior, because the whole point of automation was to take behavior out of the loop.

But the judge is a model. It has a version. It has biases. And the day it changes — because your eval-platform team swapped it for something cheaper, or because the provider silently rolled the weights behind a -latest alias — every historical score it produced becomes incomparable to every new one. Your quarter-over-quarter quality trend is now denominated in two different currencies, and no one printed an exchange rate.

This is not a hypothetical edge case. It is the default outcome of using an LLM as a measurement instrument without versioning it like one.

When LLMs Review LLMs, Errors Get Laundered Not Caught

· 10 min read
Tian Pan
Software Engineer

Trace the path of a single quality signal through a modern AI pipeline. An agent drafts a response. A second model reviews it and scores it 9 out of 10. That score gets logged. At the end of the quarter, the logged scores become the new eval set, and the next model is tuned to do well against it. Now ask the obvious question: where in that loop did a human ever look at the actual output?

In a lot of pipelines, the honest answer is nowhere. The agent that does the work is reviewed by another agent, and that reviewer's verdict feeds the next round of evaluation. The loop is closed. It runs continuously, it produces a dashboard, and the dashboard is green. What it does not contain, at any point, is a measurement against reality.

The Unhelpful-but-Safe Failure: When Refusal Rate Is the Wrong Safety Metric

· 10 min read
Tian Pan
Software Engineer

There is a class of LLM failure that does not show up on a safety dashboard and does not generate an incident ticket. The model declines politely. It cites a reasonable-sounding policy. It offers a four-paragraph hedge instead of an answer. The user closes the tab. The trust score in the postmortem reads "no incident." The retention chart, six weeks later, says otherwise.

Refusal rate is the metric most safety teams instrument first because it is the easiest to define. A model either complied or did not, and you can count the "did nots." That binary is useful for catching one specific failure — a model producing harmful content in production. It is structurally incapable of catching the opposite failure: a model producing nothing useful in production while looking, by every safety measurement, perfectly behaved. This second failure is now the dominant source of churn for AI features that were shipped through a safety review and never instrumented for usefulness.

Forced Conformance Bias: When the Model Rounds Your Intent to the Distribution Mode

· 10 min read
Tian Pan
Software Engineer

A user asks for "a haiku about Postgres replication." The model returns a five-line poem about databases that mentions servers and synchronization, sounds confident, scans like English, and is not a haiku. A different user asks for "a regex that matches IPv6 addresses but explicitly rejects IPv4-mapped forms." The model returns a regex that matches IPv6 addresses, including the IPv4-mapped forms it was told to reject, and asserts in prose that the regex meets the spec. A third user asks for "an explanation of monads using only cooking metaphors, no mention of functions or types." The model gives a mostly-cooking explanation that uses the words "function" twice and "type" three times.

None of these is a refusal. None is an obvious hallucination. The model didn't say "I can't do that." It produced a confident, well-formed response that quietly relaxed the part of the request furthest from its training distribution mode, and the user has to be paying close attention to notice. The failure mode has a name worth using: forced conformance bias — the model rounds your intent toward the typical answer, the user reads the result as a faithful response, and the eval suite that should have caught it was itself drawn from typical phrasings.

This is not a model quality problem in the usual sense. The model is doing exactly what its training pushed it toward. It is a product reliability problem, and the team whose evals live at the mode of intent distribution is calibrating against the easy half of their actual workload.

Ship Your AI Feature Before It Feels Ready

· 9 min read
Tian Pan
Software Engineer

Most AI features that ship late don't ship late because they're broken. They ship late because the team is still optimizing for a test suite that doesn't reflect how real users behave. The benchmarks look better each week. The evals trend upward. And the gap between "lab performance" and "production value" quietly widens.

The uncomfortable truth is that the first 500 real users will surface more actionable problems in two weeks than four more weeks of prompt tuning ever could. This is not an argument for shipping garbage. It's an argument for recognizing that your current calibration of "ready" is almost certainly miscalibrated — and that real usage data is the only thing that corrects it.

When LLMs Grade Their Own Homework: The Feedback Loops Breaking AI Evaluation

· 10 min read
Tian Pan
Software Engineer

Here is a finding most AI teams don't want to sit with: in a large-scale study that generated over 150,000 evaluation instances across 22 tasks, roughly 40% of LLM-as-judge comparisons showed measurable bias. That bias wasn't random noise—it was systematic, reproducible, and correlated with how models were trained. When you use a model to generate your eval set and then use the same model (or a close relative) to grade it, you're not measuring quality. You're measuring how well a system agrees with itself.

Synthetic eval data has become standard practice for good reasons. Human annotation is slow, expensive, and hard to scale. LLM-generated test cases let teams spin up thousands of examples overnight. The problem surfaces when the generator and the judge share a common ancestor—which, in 2025, is almost always the case. The result is an eval pipeline that confidently reports high scores while hiding the exact failure modes you built it to catch.

The Eval Debt Ratchet: How Teams Get Buried Cleaning Up What They Shipped on Vibes

· 10 min read
Tian Pan
Software Engineer

Three months after shipping a document summarization feature, a team at a mid-size company runs a prompt improvement. The new prompt scores better on the five examples they tested manually. They deploy it Friday afternoon. Monday morning, their Slack is full of user reports: summaries are now truncating half the document and presenting the truncated version as complete. The feature looked fine. The change passed review. Nobody noticed because there was no evaluation — no golden test set, no regression baseline, no automated check. The ratchet had been turning silently for months.

This is eval debt in its most recognizable form. The team didn't skip evaluations because they were careless. They skipped them because writing evaluations for AI features is harder than it sounds, the feature shipped fast and looked good, and nobody wanted to slow down a team with momentum. Now they're paying the compound interest.