When Your Evals Disagree: A Signal Hierarchy for the Week the Numbers Contradict Each Other
It's Tuesday morning, the week after a prompt change shipped to half your traffic. You open four dashboards. The held-out golden set scored by the LLM judge says +8%. The human-rater panel that samples production weekly says no change. The A/B test on downstream conversion says −2%. The thumbs-up rate is flat. Four signals, four verdicts, and a standup in fifteen minutes where someone is going to ask whether you ship the prompt or roll it back.
The temptation is to pick the number that confirms what you already wanted to do — and the team will, because nobody on the call has a written rule for which signal wins. The disagreement isn't a measurement bug. It's the predictable output of a system that bolted four evaluators together without a hierarchy, and the cost of not having one is that every release week becomes a debate about whose number to trust.
This is not a rare situation. It's the steady state of any LLM product with more than one evaluation surface. The signals are designed to disagree — they answer different questions on different cadences against different populations. Treating them as interchangeable is what produces the standoff. The fix is not better evals; it's an explicit signal hierarchy, written down before the next disagreement, with rules for when each signal is allowed to overrule the others.
Why the four signals point in four directions
Each evaluator answers a different question against a different ground truth. The disagreement is information, not noise — once you know what each one is actually measuring.
The LLM-judge golden set measures whether your prompt change improved performance on a frozen distribution of inputs your team hand-curated. It's cheap, fast, and runs on every PR. It's also biased toward the kinds of cases your team knows to write, scored by a model whose calibration drifts independently from your product. A +8% lift here means "the judge thinks this version handled our test cases better today than yesterday's version handled them yesterday." That's a real signal. It's just narrower than people read it as.
The human-rater panel measures whether trained humans, scoring a weekly sample of production traffic against a rubric, see a difference. Humans are slow, expensive, and high-variance — but they're the calibration anchor for everything else. When humans say "no change" and the judge says "+8%," the most likely explanation is that the judge has shifted under your feet (a model upgrade, a prompt format change, a new few-shot example) without the humans agreeing the world has shifted with it.
The A/B test on conversion measures whether the prompt change moved a downstream business metric on real users in real sessions. It's the only signal that proves anything about value delivered. It's also the noisiest: conversion has dozens of upstream causes, the test usually runs underpowered for the effect size, and the metric responds on a delay measured in days. A −2% read in week one is closer to a coin flip than a verdict, but teams routinely treat it as ground truth because it's the one number leadership reads.
The thumbs-up rate measures whether users who chose to give explicit feedback rated the response positively. Three things to know about it: only 1–5% of sessions ever get rated, the distribution skews toward complaints (negative-feedback bias), and the response asymmetry means thumbs-up rate is mostly a measure of whether you're failing badly, not whether you're succeeding more. A flat thumbs-up rate is consistent with either "no change" or "moved within the noise floor of a heavily-biased sample." It is rarely the signal that should drive a ship/no-ship call.
The hierarchy: which signal wins under which conditions
A signal hierarchy is not "rank the four metrics from best to worst." It's a documented map of which signal wins under which conditions, with explicit rules about when each is allowed to overrule the others. The mapping that holds up in practice looks roughly like this:
- Judge for fast iteration. It's the inner loop. Use it to decide which prompt variants are worth running a more expensive eval on. Never use it to override a human or A/B signal in the opposite direction — the judge can be miscalibrated, and a +8% on the judge against a flat human panel is the most common signature of judge drift, not improvement.
- Humans for calibration. The human panel is your ground truth for what good output looks like. It's how you detect that the judge has drifted, the rubric has aged out, or the task has shifted. It is not your ground truth for what users want — humans rate in a controlled context that real sessions don't share.
- A/B for ground truth on impact. The A/B test is the only signal that proves the change moved a metric tied to user value. It overrules the judge and the panel when adequately powered, the cohort matches the change, the metric was pre-registered, and the run length matches the metric's response time. When any of those is missing, the A/B is a directional signal at best, not ground truth.
- Telemetry (thumbs-up, latency, error rate) for monitoring. Production telemetry surfaces drift, abuse, cost spikes, and outright failures faster than any other signal. It is the smoke detector, not the verdict. A flat thumbs-up rate is consistent with success and failure both; a 30% drop in thumbs-up rate means stop the release immediately.
The rule that follows from this: a positive judge signal does not ship if the A/B test is negative and adequately powered. A negative A/B test does not roll back if the human panel says quality improved and the metric is moving in a direction explained by a known confound (seasonality, traffic mix shift, an upstream change in another team's surface). A flat thumbs-up rate never ships or rolls back anything by itself.
Write this down. The team that hasn't written it down will relitigate it every week, and the team member with the most political capital will win — which is not the same as the right answer winning.
The calibration anchor: detecting judge drift before it costs you a release
The judge is the cheapest signal and the one you'll lean on hardest. It's also the one that drifts most silently. When the judge drifts upward — scoring everything more leniently — every prompt variant looks like an improvement, the team ships an "improvement" weekly, and only the human panel ever notices that the actual quality bar has stayed flat or fallen.
The defense is a frozen calibration slice of the eval set, scored by humans once, then re-scored by the judge on every model upgrade, framework upgrade, or judge-prompt change. The judge's score on this frozen slice is not allowed to be the variable you're optimizing. It's the variable that proves the judge is still measuring what it measured last quarter. If the judge's score on the frozen slice moves while the human scores stay fixed, the judge has drifted, and every comparative metric that depends on it is contaminated until you re-anchor.
A few practical patterns make this work:
- Hold one slice of the eval set frozen for at least two quarters. Don't add to it, don't tune against it, don't let it ride forward with the judge.
- Re-score the frozen slice with the judge every time you change the judge model, the judge prompt, the rubric, or the few-shot examples. Track the delta as a first-class metric next to your headline eval scores.
- Periodically (monthly is a good cadence) sample 5–10% of judge verdicts on live traffic and have a human re-grade them. Track human-judge agreement over time. A downward trend means the judge is broken, the task has shifted, or the rubric has aged out — all three require intervention, none of them is "ship the next prompt."
The discipline this creates is small. The team that skips it spends a quarter chasing a phantom regression that turns out to be a judge that started rating outputs more harshly after a vendor upgraded the underlying model.
A/B test design as a precondition, not a postscript
The most common reason A/B tests "disagree with the judge" is that the A/B test was set up after the judge said the prompt looked good, and the design didn't survive contact with the data. The test ran on a cohort the prompt change wasn't expected to affect. The metric was picked because it was easy to query, not because it was sensitive to the change. The minimum-detectable-effect was never computed, so the test was underpowered for the actual lift you'd plausibly see. The win condition was decided after the results came in.
A pre-registered A/B test design is what turns the A/B into a reconcilable signal rather than a debate prop. Three minimum requirements:
- Pre-register the metric, the cohort, and the minimum-detectable-effect before the test runs. Write these in a doc, share it, and don't let the team change them once results start landing. The cohort should match the population the prompt change targets — running an A/B against your full traffic when the change only affects a 5% intent slice will dilute the effect into invisibility.
- Power the test for the effect size you actually expect, not the effect size you'd be excited to find. A judge-set lift of +8% rarely translates to anything close to that on a downstream business metric; if your test is powered to detect a 5% conversion lift, it will read flat or noisy on a change that produces a real but smaller lift. Underpowered tests are not "inconclusive" — they are loud noise that drowns the signal.
- Match the run length to the metric's response time. Conversion metrics that complete on a multi-day delay can't be read in three days. Calling an A/B at day three because the team is impatient is the same mistake as calling an election with 2% of precincts reporting.
Once the design discipline is in place, A/B disagreements with the judge become diagnostic rather than aporetic. A judge says "+8%" and an adequately-powered A/B says "−2%" tells you something specific: the prompt change improved the surface the judge was scoring (probably the cases your team curated) and degraded a surface the A/B was sensitive to (probably the long tail of intents your golden set under-represents). That's a useful finding. It's also a finding that the team can act on, instead of arguing about whose number to believe.
The org failure mode: nobody owns the reconciliation
The deeper failure isn't that the signals disagree. It's that no one role is accountable for resolving the disagreement. The team that owns the prompt cites the judge. The team that owns the product cites the A/B. The team that owns retention cites the telemetry. The team that owns quality cites the human panel. Each is honest. Each is selecting the signal that flatters their priors. And the meeting ends with a compromise that satisfies no one and ships nothing.
The reconciliation has to be owned by a named role with authority to call the shot — typically the AI platform lead or a "head of evals" function — backed by a written escalation path when signals contradict. The signal hierarchy is the contract that role enforces. The role's job is not to relitigate the hierarchy weekly but to apply it consistently and to flag the cases where the hierarchy didn't fit and needs revision. Without that role, the hierarchy becomes a doc nobody reads.
A few patterns reduce the political surface area:
- The hierarchy is approved by all four owning teams before any disagreement happens. The week of the disagreement is the worst time to negotiate which signal wins.
- Disagreements that the hierarchy doesn't resolve get escalated to a structured post-mortem, not a Slack debate. The output is a hierarchy revision, not a one-time decision.
- The signal hierarchy itself is reviewed quarterly. The world changes — a new model class, a new eval methodology, a new product surface — and a hierarchy that was right last quarter may not be right this quarter.
The eval system as a maintained property
The takeaway practitioners learn the hard way is that a multi-signal eval system is not a property of the model or the prompt. It is a maintained property of the evaluation infrastructure, and it requires the same discipline you apply to any other production system: explicit contracts, written-down hierarchies, calibration anchors, drift detection, owned escalation paths, and a periodic review cadence.
The team without an explicit signal hierarchy will spend every release week relitigating which number to trust, and the cost shows up in the calendar before it shows up in the metrics: senior engineers arguing instead of shipping, prompt changes pinned in review for days because no one will commit, A/B tests stopped early because patience ran out before significance arrived. The fix is unglamorous — a doc, a frozen slice, a pre-registered test, a named owner — but it is the difference between an eval system that informs decisions and an eval system that produces enough numbers to support whichever decision the room had already made.
Write the hierarchy down before the next disagreement. The next disagreement is on the calendar; you just don't know its date yet.
- https://labelyourdata.com/articles/llm-as-a-judge
- https://arxiv.org/html/2506.13639v1
- https://aclanthology.org/2025.findings-emnlp.1361.pdf
- https://galileo.ai/blog/llm-as-a-judge-vs-human-evaluation
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://www.langchain.com/articles/llm-as-a-judge
- https://arxiv.org/html/2508.06225v2
- https://aclanthology.org/2025.gem-1.33.pdf
- https://mstone.ai/question/difference-between-online-and-offline-llm-evaluation/
- https://labelstud.io/learningcenter/offline-evaluation-vs-online-evaluation-when-to-use-each/
- https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
- https://deepchecks.com/llm-judge-calibration-automated-issues/
- https://arize.com/llm-as-a-judge/
- https://www.godaddy.com/resources/news/calibrating-scores-of-llm-as-a-judge
- https://www.kinde.com/learn/ai-for-software-engineering/best-practice/llm-as-a-judge-done-right-calibrating-guarding-debiasing-your-evaluators/
- https://www.braintrust.dev/articles/llm-evaluation-guide
