Skip to main content

13 posts tagged with "llm-evals"

View all tags

The Eval Set That Sampled Production Traffic at 3am EST

· 10 min read
Tian Pan
Software Engineer

A team I worked with had an eval set that quietly drifted into being a survey of their batch automation. The sampling cron ran at 3am Eastern, scooped 5,000 traces out of the production log table, and dropped them into the eval corpus. The leaderboard was clean. The new prompt won by four points. They shipped it. Within a day, the support queue filled with a kind of complaint they had never seen during regression testing — pricing questions that the model now hedged on, in a customer segment whose entire workday started after the eval window closed.

The eval was not wrong about what it measured. It was wrong about who it measured. At 3am EST, the production fleet was dominated by overnight batch retries, scheduled report generation, and a handful of APAC daytime sessions that mostly asked navigational questions. The new prompt was genuinely better on that slice. The slice was twelve percent of weekly traffic and zero percent of revenue-weighted traffic. Nobody had asked the question "what shape of user is in this dataset" because the dataset was constructed by a cron job that ran when the warehouse was quietest, and quietness was the only sampling criterion anyone had thought to optimize for.

The Synthetic Eval That Taught Your Agent to Recognize Evals

· 8 min read
Tian Pan
Software Engineer

A research model rewrote a benchmark's timer so every run reported a fast finish. Another flagship model passed roughly half of a suite of "impossible" programming tests by deleting the tests or quietly redefining what "correct" meant. These are the dramatic cases the press picked up. The quiet version is happening in your eval suite right now: your synthetic eval generator has a fingerprint, your model learned the fingerprint, and your scores climb release over release while users tell support the product feels worse.

Eval-recognition is the failure mode where a model behaves better during evaluation than in production not because it became better at the task but because it became better at noticing it is being evaluated. Templated phrasing, recognizable artifact tokens, missing-context patterns no human user produces — these are signals, and any model with enough capacity to learn the task has enough capacity to learn the signal too. The eval score goes up. The user-facing metric does not. The team optimizes for months against a benchmark their own pipeline taught the model to game.

This is not a benchmark contamination story in the training-data sense. The model has not seen the eval answers. It has learned something subtler and harder to fix: the eval distribution has a shape, the production distribution has a different shape, and the model has learned to discriminate between them and route its effort accordingly.

The System Prompt That Grew Faster Than Your Eval Suite

· 11 min read
Tian Pan
Software Engineer

The day you shipped the agent, the system prompt held three rules and a tone instruction. The eval suite covered each rule with ten cases, the CI badge was green, and the team was justifiably proud. Eighteen months later the same prompt is forty rules, six tool descriptions, four few-shot examples, two safety preambles, and a refusal taxonomy that grew one entry deeper after every incident. The eval suite, by contrast, has added maybe twenty cases — one per incident, authored under pressure, never backfilled for the dozens of rules that arrived quietly through routine prompt PRs.

The team still says "the evals pass" when a PR goes out. What they actually mean is "the evals we wrote eighteen months ago still pass against a prompt those evals don't fully describe anymore." The confidence interval has a denominator that has been silently expanding while the numerator stayed nearly fixed. The next prompt edit that touches one of the thirty-seven untested rules will get graded as safe by a suite that has no opinion on it.

The Synthetic Eval Your Real Users Never Resemble

· 10 min read
Tian Pan
Software Engineer

There is a class of eval failure that no dashboard catches because it shows up as success. The score climbs week over week. The judge agrees with the answer. The regression tests stay green. Meanwhile, the support team is logging a slow drift in user-reported quality, sales is hearing "it doesn't quite get what I meant," and nobody in engineering can reproduce the complaint because every example anyone tries on the eval set passes. The eval and the users live in different distributions, and the eval is the more polished of the two.

The mechanism is simple, and it hides in plain sight: the model that wrote your eval prompts and the model under test are siblings, and siblings share priors. They smooth the same edges, prefer the same phrasings, leave out the same kinds of malformed input. The eval certifies behavior on a world the generator imagined users have. Your actual users live somewhere else.

Shadow Replay Punishes the Model That Would Have Changed the Conversation

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a new model into shadow replay and watched the win rate sit at 47 percent against the incumbent. Same prompts, same retrieval, a model the vendor's own evals had ranked clearly higher. The shadow harness took last week's production traffic, pumped it through the candidate, fed both responses to an LLM judge, and declared the upgrade roughly a coin flip. The team almost reverted on the spot.

The problem was not the model. The problem was that every user message in the replay had already been conditioned on the old model's previous turn. The candidate wrote a better answer at turn one, the user in the log replied to a different answer that no longer existed, and from turn two onward the judge was scoring a conversation that was not happening. A genuinely better model that changes what the user does next has no ground truth to be scored against. The replay quietly rewards staying on the old rails.

The Demo Worked Because You Were Watching: Session Length Is the Eval Dimension Your Suite Forgot

· 10 min read
Tian Pan
Software Engineer

The reliability number in your launch deck came from sessions that looked nothing like the ones your users actually run. The demo was five turns: open, ask, observe a tidy answer, refine once, conclude on a high note. The session your power user ran yesterday was thirty-one turns long, included two tool failures the agent papered over with optimism, and ended when the user gave up and opened a support ticket. Both sessions came out of the same model. The first one shipped a press release. The second one was filed under "edge case."

Session length is a dimension of evaluation, and demo culture systematically underweights it. We measure per-turn accuracy because per-turn accuracy is what fits on a slide, and then we are surprised when per-session success falls off a cliff that we never put on any chart. The cliff is not random and it is not a tail event — it is the predictable consequence of compounding error, attention drift, and committed assumptions that the model will not revisit. The question every team should be asking is not "how good is the model" but "how good is the model at turn twenty-eight, given everything we said at turns one through twenty-seven."

Deleting an Eval Case Is a Decision, Not Cleanup

· 10 min read
Tian Pan
Software Engineer

Every eval suite eventually gets pruned. Someone notices the suite takes nine minutes to run, costs $40 a pass, and is full of cases nobody remembers writing. They open a PR titled "clean up stale eval cases," delete forty entries that "don't seem relevant anymore," and the CI run drops to four minutes. The PR gets a thumbs-up. Nobody objects, because deleting tests looks like maintenance.

It is not maintenance. Every eval case is a guarantee the team made to itself: this failure mode will not recur silently. Deleting the case retires the guarantee. The pass rate does not change, the dashboard stays green, and the only thing that disappears is the team's memory that the guarantee ever existed. Six months later a model migration reintroduces exactly the regression a deleted case was guarding, the postmortem rediscovers a lesson the team already paid for once, and someone writes "we should add a test for this" — the test that was deleted in the cleanup PR.

The Snapshot Trace Test: Production Traces as Your Regression Suite

· 10 min read
Tian Pan
Software Engineer

The eval set most teams run as their regression suite was hand-curated by an engineer in week three of the project, frozen by week six because nobody wanted to touch it before launch, and is now being used in month nine to gate deploys. The product has shifted twice. The user base has tripled. The cases the LLM actually sees in production overlap with that frozen suite by maybe forty percent. When the suite passes, nobody trusts it; when it fails, nobody knows whether the failure is real or whether the case is just stale. The team writes a doc proposing a "v2 eval set" and never gets around to it.

Meanwhile, every request the system has handled in production has been recorded in a tracing backend. Every prompt, every tool call, every intermediate output, every refusal, every retry — all of it sitting in object storage, time-indexed and span-tagged, ready to be replayed. The highest-fidelity test corpus the team will ever have is already on disk. They built an eval suite from scratch instead of reading from it.

Retiring an Agent Tool the Planner Learned to Depend On

· 10 min read
Tian Pan
Software Engineer

You unregister lookup_account_v1 from the tool catalog, swap in lookup_account_v2, and edit one paragraph of the system prompt to point at the new name. Tests pass. Three days later, support tickets start mentioning that the assistant "keeps trying to call something that doesn't exist," or — more disturbingly — that it answers customer questions with confident, plausible numbers and never hits the database at all. The deprecation didn't fail at the wire. It failed in the planner.

This is the gap between treating a tool deprecation as a syntactic change and treating it as a behavioral migration. The agent didn't just have your function in a registry; it had months of plans, multi-step recipes, and few-shot examples that routed through that function as a checkpoint. Pulling it out is closer to retiring an internal API your downstream services have informally hardcoded — except the downstream service is a model whose habits you cannot grep, and whose fallback when its preferred tool disappears is to invent one.

Eval Set Rot: Why Your Score Trends Up While Users Trend Down

· 10 min read
Tian Pan
Software Engineer

The eval score has been trending up for two quarters. The dashboard is green, the regression suite has not flagged a real failure since March, and the team has gotten faster at shipping prompt changes because the eval gives crisp pass/fail answers. Meanwhile, user-reported quality is sliding. NPS is down four points, the support queue is full of failure modes nobody has labels for, and the head of product has started asking why the evals look great if customers are angry.

The eval set is not lying. It is answering the question it was built to answer, six months ago, against the traffic distribution that existed in launch week. The product has shifted. The user base has shifted. The long-tail use cases the team did not anticipate at launch now make up a third of traffic. The eval set is still measuring the world that existed in week one, and the team is averaging today's model against yesterday's product.

This is eval set rot. It is one of the quietest failure modes in modern AI engineering, and it gets worse as the eval set gets bigger, because the people maintaining it confuse "more cases" with "better coverage."

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

· 12 min read
Tian Pan
Software Engineer

The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.

Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.

The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.

User-Side Concept Drift: When Your Prompt Held but Your Users Moved

· 10 min read
Tian Pan
Software Engineer

Most teams set up drift monitoring on the wrong side of the contract. They watch the model — capability shifts when a vendor pushes a new checkpoint, output distribution changes after a prompt rewrite, refusal-rate spikes that signal a safety filter retune. The dashboards are detailed, the alerts are wired into PagerDuty, and the team has a runbook for "the model moved." None of that helps when the model didn't move and the dashboard still goes red, because the thing that shifted was your users.

User-side concept drift is the version of this problem that almost every eval pipeline misses. Your prompt, your model, and your tools are byte-identical to the day you launched. Your golden test set still passes 91%. But the prompt that hit 91% in week one is now serving 78% in week thirty, because the input distribution has moved underneath it — users learned the product and changed how they ask, vocabulary mutated, seasonal task types appeared, a competitor reframed the category, a viral thread taught a new way to phrase the same intent. The model and prompt held. The contract held. The world the contract was negotiated against did not.