Skip to main content

Eval Differential as Branch Protection: Ship Score Diffs, Not Score Floors

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a clean-looking eval gate: every prompt PR had to score above 0.85 on the golden set or the merge button stayed grey. They were proud of it. Six weeks in, average quality had quietly drifted from 0.93 to 0.87 — every PR cleared the bar, every PR landed, and no individual change owned the regression because none of them broke the rule. The bar was set against a snapshot of last quarter's quality, not against last week's.

That's the failure mode of an absolute-threshold eval gate: a PR that drops the score from 0.92 to 0.86 ships green, while a PR that lifts the score from 0.80 to 0.84 fails the same gate. The team learns "ship if it clears the bar" — a quality story. The signal you actually want is "ship if this change is non-regressive on the slices that matter" — a regression-detector story.

Coverage tools figured this out a decade ago. They report the diff against the parent commit and they break it down per file. Eval gates haven't caught up.

The score floor is a stale gate

Score floors get written into a CI pipeline once, usually during the launch sprint of an AI feature, and they stop being recalibrated about ten minutes after that. The number that made sense when the model was at 0.88 keeps making sense when the model is at 0.94 — except it doesn't, because the 0.85 floor now permits a 9-point regression to land silently.

The deeper issue is what the team learns from the gate. A floor teaches "did this PR pass quality" as a yes/no question. A diff teaches "did this PR move quality, and in which direction, on which slice." The first framing optimizes for clearing a bar; the second optimizes for not making things worse. Those are different cultural defaults, and they show up in code review. Reviewers stop asking "did the score go down?" because the green check answered that for them — even though the green check only confirmed "still above 0.85."

You can sometimes patch a floor by raising it after each release, ratcheting toward the current score. That's a worse fix than it sounds. Now you've coupled the bar to the noise floor of last week's eval run, and the first PR that hits a small-N slice with high variance will fail spuriously, the team will lower the threshold to unblock, and you've taught everyone that the eval gate is a thing you bargain with rather than a thing you trust.

The diff is a smaller, more honest claim

The contract you actually want is narrower: "this PR does not regress on the slices that matter, beyond the noise floor." Not "this PR is high-quality." High-quality is a release-readiness question, not a per-PR question, and it's the wrong question to ask 40 times a week against the same golden set.

Concretely, the signal a baseline-aware eval reports is the per-slice score delta against the parent commit (or the main branch's last green commit), with the noise band shown alongside. A reviewer reading the PR sees "intent classification: −0.03 (within noise), citation accuracy: −0.08 (regression), refusal rate: +0.02 (improvement)." They can act on each line independently. The aggregate score may still be above the old floor, but the citation slice has fallen off a cliff and the gate catches it.

This framing also fixes the asymmetry in how floors treat improvements. A PR that lifts a struggling slice from 0.62 to 0.71 is a meaningful win, but a global floor doesn't see it — the aggregate barely moves. A diff view surfaces it as a clear delta on the slice that needed the help, and reviewers can give credit accordingly.

Per-slice non-regression beats a global aggregate

Aggregate eval scores hide the regressions you most want to see. A 2-point drop in a global average can mask a 15-point collapse in a single category whose volume is small but whose user-facing severity is high — the long-tail intent that fails ten times a day in a way that drives churn. The PR's score diff looks innocuous until you slice it.

The discipline is to define slices the way you define test files: explicitly, with stable membership over time, and with a clear ownership story. Useful slice axes include intent or task type, language or locale, input length bucket, customer tier, and "was this question previously a bug." The rule that should land in the gate is not "global average must not drop more than X" but "no slice's score may regress by more than its noise band." That rule fails fast on the categories where regressions matter most, and it doesn't ding a PR that's neutral on a slice it never touched.

Slice ownership matters because slices fragment. As soon as the eval suite has 20 slices, somebody starts asking why slice 14 is failing on every PR, and the answer is usually "slice 14 has 9 examples in it and any noise looks like a regression." That's a slice-design problem, not a gate problem — but the team will solve it by relaxing the gate, which destroys the signal everywhere else. The owner of a slice should also own its example count, its labeling quality, and its noise budget.

Flake budgets distinguish noise from signal

Eval suites are not deterministic. Sampling temperature, judge-model variance, and small-N slices all produce run-over-run variation that will trip a naive non-regression rule. A gate that fires on noise is a gate that gets disabled, so the noise budget has to be a first-class part of the design.

The cleanest approach is to characterize each slice's noise band before you ship the gate. Run the eval suite N times against the same prompt, compute the per-slice standard deviation, and treat any delta inside ±2σ (or whatever confidence level you want) as flake rather than signal. Slices with more examples will have tighter bands; slices with fewer will have wider ones, which is exactly the right behavior — a small slice telling you it might have regressed should require a bigger swing to be believed.

This buys you a separate insight for free: slices with absurd noise bands are slices that are too small to use for gating. Promote them to "informational only" until the example count comes up. The noise band tells you which slices are load-bearing in the gate and which are decorative.

A flake budget also clarifies what to do when a PR fails. If the regression is inside the noise band on a small slice, the right action is to re-run, not to investigate. If the regression is outside the band, or if it's inside the band but consistent across re-runs, the right action is to investigate. Neither of those is "lower the threshold."

The PR comment is the actual product

The diff and the slices and the noise budget all exist to feed one artifact: a PR comment that a reviewer can read in twenty seconds and act on. Coverage tools nailed this user experience — a Codecov comment shows the patch coverage, the file-level diff, and the change against the base, with the things that need attention pulled to the top.

An eval-diff PR comment should do the same shape of work. Lead with the slices that regressed beyond their noise band. Show the slices that improved beyond their noise band. Roll up the rest into "no significant change." Link each slice row to a sample of the regressing examples, because the question after "did it regress" is always "show me what broke." If the comment makes the reviewer click through three dashboards to answer that, it loses to "lgtm, the average is fine."

A few specific things make this comment trustworthy in practice. Comparing against the parent commit on the same branch, not against the latest main, prevents noise from unrelated merges leaking into the diff. Pinning the eval suite version, the judge model version, and the temperature in the comment header makes the result reproducible — and forces an obvious question when any of those changes. Including a small "re-run this slice" button (or its CLI equivalent) gives reviewers a path to handle suspected flake without disabling the gate. None of these are clever; they're the same affordances test-runner UIs converged on years ago.

A migration that doesn't blow up the team

Most teams already have an absolute-threshold gate they can't just rip out, because the gate is wired into branch protection rules and several months of muscle memory. A safer migration runs both gates in parallel. Keep the floor as a soft alarm — a red status check that doesn't block — and introduce the diff as the new blocking gate. After a couple of weeks of data, you'll see whether the diff catches the things the floor missed (it does) and whether the floor catches anything the diff didn't (mostly: the cases where you needed to recalibrate the diff's noise bands).

While that runs, instrument the comment but don't enforce it. Let reviewers see the per-slice diff for two sprints before any check turns red on it. The instrumentation period is when you discover that three of your slices have unusable noise bands, that one of them is actually two slices in a trench coat, and that the judge model needs version-pinning before any of this works. None of those discoveries are graceful to make under a blocking gate.

The end state isn't "no thresholds anywhere." A few absolute thresholds still earn their keep — refusal rate, hallucination rate on a high-severity slice, latency p95. Those are user-facing contracts, not quality scores, and a floor is the right shape for them. Everything else — the aggregate quality scores, the per-task accuracy numbers, the rubric averages — wants a diff.

What to expect after the switch

Three things change once a team is shipping diffs instead of floors. First, PRs become more cautious in the right way: an author who would have batched five prompt edits into one PR now splits them, because the diff makes it obvious which edit moved which slice. Second, slice ownership starts mattering. Someone has to be responsible for the 9-example slice that keeps tripping the noise budget — either by growing it, retiring it, or accepting it as informational. Third, the conversation about quality moves out of release-readiness review and into per-PR review, which is where it should have been.

What doesn't change is that the eval suite is still load-bearing infrastructure that needs investment. Diffs don't fix bad slices, missing labels, or judge-model drift. They surface those problems faster, but they don't substitute for the work of curating the suite. A team that wires up a beautiful diff comment on top of a stale, noisy, unrepresentative golden set will get a faster, more honest signal that their eval suite is broken — which is a good signal to have, just maybe not the one they were hoping for.

The deeper realization is that prompt and model PRs deserve the same regression-detector treatment that test suites get on every other change. The eval suite isn't a quality bar; it's a regression detector. Calibrating it against last quarter's snapshot is the bug. Calibrating it against the parent commit, slice by slice, with a noise budget that respects small-N reality, is the fix — and it turns the eval gate from a thing the team learns to clear into a thing the team learns to trust.

References:Let's stay in touch and Follow me for more thoughts and updates