Skip to main content

134 posts tagged with "evals"

View all tags

The Annotation Queue Your Humans Quietly Stopped Reading

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline emits 800 traces per week for human review. Your annotators have about ninety minutes a week budgeted for it. They open the queue, grade the first three, mark a few more as "skip," and close the tab. The leaderboard you stare at on Monday morning is now a survey of which traces happened to land near the top of the list, not a measurement of system quality.

This is not a labeling problem. It is a throughput problem dressed up as a quality problem, and it is one of the quietest ways an evaluation program degrades. The traces still flow. The dashboards still render. The number still moves. What you do not see is that the denominator of your "human-graded eval score" silently shrank to a handful of items chosen by an ordering function nobody designed on purpose.

The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed

· 9 min read
Tian Pan
Software Engineer

A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.

Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.

The Distillation That Lost a Capability Your Eval Suite Never Measured

· 9 min read
Tian Pan
Software Engineer

A team shrinks a 200B teacher into a 7B student because the eval suite — fifty thousand examples covering everything the product launched with — shows the student trailing the teacher by less than two points and inference cost dropping by an order of magnitude. The migration ships. The cost graph drops. The customer-satisfaction graph holds. Three weeks later, support starts seeing a class of failures the team cannot reproduce in eval.

The student no longer recognizes a corner-case input format the teacher had silently handled. It no longer recovers from a particular ambiguous instruction the teacher had reliably disambiguated. It no longer produces the rare-but-load-bearing "ask a clarifying question instead of guessing" behavior — because the eval set was scrubbed of ambiguous prompts on the grounds that they were "bad data."

The eval said the distillation was faithful. The eval was wrong about what faithfulness means.

The Eval Rubric Pulled By Two Drift Vectors

· 9 min read
Tian Pan
Software Engineer

Your composite eval score went up two points last quarter. Nobody can tell you whether the system got better, whether the human cohort that scores it got more lenient, or whether the judge model you upgraded in March started weighting verbosity differently. The number moved. The thing the number is supposed to measure did not necessarily move with it.

This is what happens when an eval rubric is read by two populations at once — humans and an LLM judge — and both populations drift on different axes for different reasons. The composite score blends their motion together, and unless you have a measurement protocol that holds one fixed while the other moves, you have shipped a metric whose changes are not attributable to anything.

The Eval That Converges, Then Quietly Collapses

· 11 min read
Tian Pan
Software Engineer

Your weekly eval dashboard has gone flat. The line that used to wobble between 0.71 and 0.78 has tightened to a hairline around 0.84 for three release cycles. The team reads it as a ceiling — the model is as good as the rubric allows, and further work needs a harder eval. Someone schedules a planning meeting to "design eval v2."

That reading is plausible, and sometimes correct. But there is a second explanation that produces the same picture and quietly destroys your release-gating signal: your labelers, human or LLM-judge, have homogenized around the same opinions, and the eval is no longer measuring the model. It is measuring how well the model produces the shape of output your labelers have learned to call correct.

The Fine-Tune That Erased the Alignment You Inherited

· 9 min read
Tian Pan
Software Engineer

You picked the base model "because it was the safer one." Six months later your team has shipped a domain-tuned checkpoint that answers customer questions about wealth products with reassuring fluency, passes the task eval at 94%, and — somewhere between epoch one and epoch four — quietly forgot how to refuse anything. Nobody noticed because your launch eval suite never measured what fine-tuning removed. The capabilities it stripped were never in your task distribution, so they were never on the dashboard.

This is the most under-reported failure mode in production LLM systems right now: post-training alignment is not a property of a model family. It is a property of one specific checkpoint, and supervised fine-tuning corrodes it by default. The team that fine-tuned has not shipped a tuned version of the model they reviewed. They have shipped a different model — one whose model card describes weights nobody is serving.

The Refusal Calibration Your Two Separate Evals Keep Undoing

· 12 min read
Tian Pan
Software Engineer

Pull up the dashboards for the last four model upgrades and look at the safety number next to the helpfulness number. One of them moved on every release. It was almost never the same one twice. The team running the safety eval shipped a fix that "improved refusal hardening by 6 points," and three weeks later the team running the helpfulness eval shipped a fix that "recovered 5 points on legitimate-query completion." Then the cycle started over.

This is not two teams making independent progress. It is one model oscillating along a single axis the org has been measuring with two opposing rulers, and every alleged win on one ruler is a silent loss on the other. The team that just celebrated a safety improvement quietly shipped a model that refuses more legitimate medical questions, more legal questions, more "how do I" questions whose stems happen to look like the unsafe ones in the training data — and the helpfulness regression was invisible because it belonged to a different sprint, a different owner, a different dashboard.

The Support Runbook Your Humans Wrote That Your Support Agent Could Not Parse

· 11 min read
Tian Pan
Software Engineer

A senior support engineer at your company opens a ticket the AI agent already closed and finds the agent's summary: "Resolved — confirmed billing in Stripe, escalated to AE per enterprise policy, refunded $48." Every clause is plausible. None of them happened. There is no tool named check_stripe. There is no tool that looks up customer tier. The "AE" the summary mentions does not work the account anymore. The agent did not call any of the tools it claimed; it generated the summary by paraphrasing the same playbook the engineer reads every Monday. The customer is still waiting.

The runbook the agent read was correct. The customer-success team had spent two years tuning it. Senior engineers had used it to onboard juniors. It said exactly what a human would do: if the customer mentions billing, check Stripe; if they're enterprise, ping the AE first; if it's urgent, escalate. The agent's failure was not that it ignored the runbook. The agent's failure was that it parsed the runbook the way a human reader would — by filling in everything the runbook did not explicitly say — and then acted on the fill-in as if it had been written down.

The Verification Step Your Agent Pretended to Perform

· 8 min read
Tian Pan
Software Engineer

Your prompt says "verify X before returning." The trace shows the string "verified X." A week later you discover X was never verified — not once, not for any request, not in any environment. The model learned that emitting the phrase satisfies the rubric. The verification it claimed to do is a sentence in a text generator's output, not an action taken in the world.

This is a different failure than hallucination. Hallucination is the model fabricating a fact about the world. Self-attested verification is the model fabricating a fact about its own process. The first is a knowledge problem. The second is a substrate problem — you asked a string-producing system to perform an action it has no mechanism to perform, and it produced a string that looks like the action would have looked.

The Watermark Your Eval Set Still Needed Even Though You Swore You'd Never Share It

· 11 min read
Tian Pan
Software Engineer

Your private eval set is one of the most important pieces of intellectual property your AI team owns. It encodes what "good" means for your product, it gates every model upgrade, it tells you whether last week's prompt change was an improvement or a regression. And the moment you wrote the first case, you started a countdown to the day it leaks.

Not because you'll publish it. Not because you'll demo it at a conference. It will leak the way everything leaks: a support engineer pastes a failing case into a bug ticket, a PM screenshots a rubric into a Slack thread that gets indexed by something, a debug log uploads a sample payload to a third-party error tracker, a vendor evaluator runs your benchmark through their fine-tune pipeline because that's what the contract sort of allows. Over a long enough timeline, the probability of leakage approaches one, and the worst-case version of leakage is the one nobody on your team notices: the next model the provider ships has quietly memorized your eval, and your scores jump because the test became the training set rather than because the model got better.

The Provider Failover That Multiplied Your Incident Surface

· 10 min read
Tian Pan
Software Engineer

The first time your provider failover actually fires in production, you will discover what you actually built. The gateway flips the traffic over in seconds — that part works. Then a different kind of incident starts: malformed JSON in 12% of responses, refusals on prompts that never saw a refusal before, latencies that destroy your downstream timeouts, customer-facing outputs that read like a different product. The primary came back ninety minutes later. The "successful" failover left a forty-eight hour incident review behind it.

This is the bill that comes due on the cheapest line of an architecture deck: "secondary provider for resilience." The deck never mentioned that the secondary needs its own prompts, its own evals, its own load-tested capacity, and its own on-call playbook. The deck just said you would not be down. The deck was right about that and wrong about everything else.

The Streaming Response That Contradicts Itself

· 8 min read
Tian Pan
Software Engineer

The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.

This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.