Skip to main content

168 posts tagged with "evaluation"

View all tags

The LLM-as-Judge Ensemble That Agreed Because All Judges Were the Same Family

· 10 min read
Tian Pan
Software Engineer

Your evaluation pipeline runs a three-judge ensemble against every model output. The judges are GPT-4 with a strict rubric, GPT-4 with a permissive rubric, and GPT-4 with a chain-of-thought rubric. They agree on 91% of cases. You report inter-judge agreement of 0.83 Krippendorff's alpha to the launch review committee. The number lands in the "substantial agreement" band that every methodology textbook treats as a green light. Three model upgrades ship against that number over six months.

An external auditor swaps one of the three judges for Claude using the same rubric and the agreement rate on hard cases drops to 64%. The eval score that justified the last three upgrades turns out to be a number that depends on which provider family you treat as ground truth. The upgrades were upgrades against GPT-4 family preferences, not against quality — because the judges were the model being judged's siblings.

The Model Card Benchmark Whose Methodology Shifted While Your Contract Cited the Number

· 11 min read
Tian Pan
Software Engineer

Your procurement team renewed the inference contract last quarter and noted, with quiet satisfaction, that the quality clause referencing "HumanEval pass@1 of 84%" had been comfortably exceeded by the provider's latest model card, which now reports 87%. Three points to the good. The clause is satisfied. The relationship is healthy. Meanwhile, your inference team's own regression suite — the one that actually exercises the tasks your product depends on — shows a 2% decline on held-out evaluation cases since the model update shipped. Both numbers are real. Only one of them is in the contract.

This is what it looks like when a marketing artifact is load-bearing in a legal document. The benchmark number on the model card is the headline of a measurement; the methodology that produced it is a footnote in an appendix nobody on the contract review chain reads. When the provider changes the methodology — switches from greedy decode to best-of-three sampling, adds a structured-output system message, swaps the prompt template to match the model's new chat tuning — the number moves in a way that has nothing to do with your traffic and everything to do with how the number is computed. Your contract clause cites the number. The counterparty controls the protocol that produces it. You've signed a clause whose meaning the other side can revise without violating it.

The OpenTelemetry Tail Sampler That Dropped Exactly the LLM Spans Your Post-Mortem Needed

· 11 min read
Tian Pan
Software Engineer

A user pings support: "the assistant told me to cancel my service to update my address, that's insane." Your team opens the incident, asks for the conversation ID, drops it into the tracing UI, and gets a polite "no spans found for this trace." The 24-hour retention window closed an hour ago. The tail sampler decided this conversation was a routine success because the response was a syntactically valid JSON object, returned with a 200, in 1.4 seconds. By every signal your collector understood, nothing happened.

The model returned a sentence that destroyed a customer relationship, and your observability pipeline classified it as uneventful. This is not a bug in the sampler. The sampler did exactly what you configured it to do. The problem is that the policy you wrote was designed for a request-response world where "success" and "worth keeping" were close enough to be the same thing, and you ported it unmodified into a system where they are not.

The Supervisor Agent That Rubber-Stamped Its Subagent Because They Shared a Prompt Template

· 9 min read
Tian Pan
Software Engineer

A team I talked to last month was proud of a number: their supervisor agent approved 97% of its subagents' plans on first review. They read that as "the subagents are competent." A red-team review six weeks later read it as "the supervisor and the subagents are the same evaluator scoring its own output." Both readings fit the data. Only one of them was load-bearing in production.

The supervisor-reviews-subagent pattern is the most common shape multi-agent systems take in 2026 — somewhere around 70% of production deployments, including most of the reference designs the big labs publish. It looks like a check on paper. A planner decomposes the task, specialist workers produce plans, a supervisor reviews each plan before authorizing execution. Separation of concerns, clean audit trail, the works. The problem is that if you build the supervisor and the subagents from the same base prompt template — even with role-specific addenda differing by a paragraph — you have not built a check. You have built a system whose review step is an artifact of the same model agreeing with itself.

Your Eval Suite Is a Production Workload: When Nightly Tests Starve Live Traffic

· 11 min read
Tian Pan
Software Engineer

A team's most successful AI feature went dark at 2:14 AM on a Tuesday. The pager said the model API was returning 429s in steady state. The model was healthy. The provider was healthy. The team's own production traffic was nominal. What was eating the quota was the nightly eval suite — the same suite the team had been proudly expanding the previous week. The eval and the product shared an organization key, and on that night the eval was the noisy neighbor that broke its own roommate.

The eval wasn't misbehaving. It was doing exactly what its authors designed: a thousand cases against the production model identifier, on a cadence, on a schedule everyone had forgotten about because it had been quiet for two years. The expansion that finally pushed it over the limit added three hundred cases. The PR was reviewed by the eval owner and the prompt owner. Nobody on the review thread thought to ask: how much of the daily token quota does this consume?

The Agent's I-Don't-Know Rate That Fell After You Added More Tools

· 9 min read
Tian Pan
Software Engineer

You added the search tool, then the calendar tool, then the CRM tool, then four database wrappers and a calculator. The dashboard moved the way you wanted: task-completion ticked up, latency held, the "I don't know" rate dropped from 14% to 4%. Looks like a capability win. It is not. The planner did not learn more; it learned less abstention. Every question now looks answerable because there is always some tool that pattern-matches the query well enough to call. The 10 percentage points of "I don't know" you removed did not turn into correct answers — they turned into confident wrong ones, distributed across the long tail where nobody is grading carefully.

This is the false-competence trap of tool surface expansion. It is the most common way a team ships a regression while celebrating an improvement. The eval rubric measures whether the agent attempted the task and produced a plausible-shaped answer; it does not measure whether the agent should have refused. Abstention is not free, but it is the cheapest correct behavior available, and you stop being able to see it the moment your tool palette gets large enough that something always fires.

The Fine-Tune That Overfit to Your Eval Rubric and Graded Itself a Winner

· 10 min read
Tian Pan
Software Engineer

The fine-tune ships, the eval dashboard goes green, and the team sends the celebratory screenshot. A week into production, the support backlog is shaped exactly like it was before the training run. The model that scored 87 on your rubric is doing the same job, badly, that the pre-fine-tune model did at 71. Nothing leaked from your test set. The data was clean. The split was honest. What broke is more subtle: the rubric that scored the training reward is the same rubric that scored the eval, and the model learned the rubric.

This is the failure mode where a green dashboard certifies memorization rather than capability. The training loop pushed the model toward whatever the rubric rewarded, the rubric had a surface — a shape, a phrasing, a set of cues a judge model latches onto — and the model learned that surface faster than it learned the underlying behavior. By the time you evaluate against the same rubric, you are no longer measuring whether the model got better. You are measuring whether it found the rubric's tells.

The Latency-Budget Router That Was a Quality-Loss Router by Another Name

· 10 min read
Tian Pan
Software Engineer

A model router that optimizes a single loss function will deliver exactly what that loss function asks for, and nothing else. When the function is "stay under the p95 latency target," every query that would have benefited from extended reasoning gets snapped to the cheapest path the router can defend, because the fast model returns under the SLO and the slow-but-correct model would not. The latency dashboard turns green. The aggregate eval moves a fraction of a point and the team rounds it to noise. The per-slice view nobody graphs is where the actual regression lives: concentrated in the multi-step, ambiguous, and out-of-distribution queries that should have been routed to reasoning and instead got the model that finishes fast and is wrong with confidence.

This is not a routing bug. The router is doing exactly what it was built to do. The bug is in the framing — a system whose optimizer is denominated entirely in latency will produce quality regressions invisible to the metric the team is paid to keep green. It will then ship those regressions silently, because the people watching the dashboard are not the people watching the answers.

The Localized System Prompt Your Model Performs Worse Against Than the English Original

· 11 min read
Tian Pan
Software Engineer

Your English system prompt took six weeks to tune. A staff engineer rewrote the constraint list four times, the eval suite finally cleared 94% on the held-out task set, and the launch checklist green-lit it for production. Then the i18n team picked it up, ran it through the same translation pipeline that handles button labels and tooltips, and shipped the Japanese, German, Hindi, and Arabic variants the next sprint. The launch dashboard for non-English markets shows the same task volume, the same user funnel, and — until a support ticket from a Tokyo customer surfaces six months later — the same green status.

The Tokyo customer's complaint is that the agent ignored an instruction the English prompt explicitly forbids. You re-read the Japanese prompt and it says the same thing, semantically. You re-run the English eval suite against the English variant and it passes. There is no eval suite for the Japanese variant. There never was.

The Middle-Context Blindness Your Retrieval Pipeline Never Measured

· 8 min read
Tian Pan
Software Engineer

The retrieval logs are clean. Recall@10 against your hand-labeled query set has not regressed in months. The answer-quality dashboard says faithfulness is holding above 90%. Then a customer pastes a question into your support agent, the gold passage is right there at position 7 of 12 in the assembled prompt, and the model answers as if it were never retrieved.

The retrieval team will tell you the chunk was there. The prompt team will tell you the prompt was correct. Both are technically right. The model attended to the first thousand tokens, attended to the last thousand tokens, and skimmed the middle band where the answer lived. Your pipeline is hitting a positional attention bias that neither team owns, neither dashboard tracks, and neither benchmark catches.

The Near-Duplicate Filter That Took Your Only Hard Example With It

· 10 min read
Tian Pan
Software Engineer

Your dedup step reported a corpus shrink of 28% and the training run finished six hours faster. The eval numbers came in flat-to-slightly-better. Nobody opened the diff of what got removed. Three weeks later support starts paging about a class of refund-reversal tickets the model used to handle and now flatly mishandles. There are eleven training rows that touched that exact pattern. Nine of them are gone — collapsed into a single representative that kept the shortest, cleanest phrasing and dropped the messy hostile-tone variants where the model had actually learned to de-escalate. Your dedup pipeline did that, and your evals did not catch it, because by the time the eval set was built, those examples were already gone from the train set the eval was sampled from.

This is the failure mode that bothers me about deduplication as a pipeline step: it presents itself as hygiene and it is actually distribution editing. Removing exact duplicates of boilerplate is hygiene. Removing near-duplicates by a similarity threshold is a sampling decision dressed up as one. The threshold picks which slices of your training distribution survive, and the slices most likely to lose are the ones where you have the fewest examples to begin with — which are also, almost by definition, the ones you were keeping for coverage rather than count.

The Reranker You Added That Slowed Recall More Than It Improved Precision

· 11 min read
Tian Pan
Software Engineer

The offline eval was unambiguous. After bolting a cross-encoder on top of the top-50 from vector search, nDCG@5 went up four points. The team shipped it on a Tuesday. By Thursday, p99 retrieval latency had crossed the SLO by 700 milliseconds, and customer success was forwarding screenshots of empty results pages that the old pipeline would have populated. The graph that mattered — user-perceived answer quality — was down. The reranker was a regression that the team had branded as an improvement, and the eval rubric was the thing that hid the regression in plain sight.

This is one of the most common failure modes in production retrieval, and it is rarely described as what it actually is: an evaluation bug. The reranker did what it was advertised to do. It reordered the top-50 with finer-grained precision. The problem is that the metric used to justify it — offline nDCG, computed at infinite budget, against the full reranked list — describes a world the production system does not live in. In production, the answer that ships is not the best-scored reranked list. It is whatever the system can return before the request deadline. And once you write the metric that way, the reranker's contribution is no longer a four-point lift. It is a curve.