Skip to main content

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.

The pattern to copy is already in your repo. Code coverage made the same journey twenty years ago: from a nightly job whose HTML report nobody opened, to a per-PR delta posted as a sticky comment with file-by-file callouts. Most engineers cannot remember a time when coverage wasn't inline, and that's the point — once the feedback was adjacent to the change, the practice survived without anyone needing to be sold on it.

What "We Have Evals" Actually Means in Practice

Walk into any team that has been doing LLM work for more than a quarter and ask where the evals run. The answer is almost always the same: there's a script, it runs against a "golden dataset," and the results land in a notebook, a spreadsheet, or — if the team is feeling mature — a hosted dashboard. The script gets invoked manually before a release, or on a cron, or whenever a PM asks "did we improve the summarization?"

This setup makes evaluation a job — a thing you trigger, watch run, and interpret. Jobs have inertia. A job that takes ten minutes will not be triggered on every commit; a job that produces a hundred-row table will not be read on every commit; a job that lives outside the code review tool will not block a merge. The team has built a measuring instrument and then placed it three rooms away from the experiment.

The natural drift is predictable. The eval gets run on the days it gets run. PRs that touched a prompt get merged based on whether the engineer "felt good about the change" and whether reviewers could spot anything obviously wrong by reading the new template. The eval, when it does run, often regresses by two or three points on a slice nobody noticed, and the team doesn't connect the regression to the merge that caused it because the connection isn't visible in the tool where merges happen.

This is not a tooling failure of the eval framework. The frameworks are fine — DeepEval, promptfoo, Braintrust, Latitude, Traceloop, and a dozen others can all produce competent metrics on a golden dataset. The failure is that the metrics are produced somewhere other than the surface where the merge decision is made.

What Code Coverage Got Right

Code coverage as a metric has always been weak. Hitting a line of code is not the same as testing it; 100% coverage on a function that returns the wrong type is still 100% coverage; teams that chase the number write tests that exercise lines without asserting on behavior. Despite all of that, coverage as a practice survived because of one specific thing: the report moved into the PR comment.

The progression matters. Early coverage tools produced HTML reports — a directory of files you opened in a browser, color-coded line by line. They were genuinely useful and almost nobody used them. The next step was the dashboard, hosted somewhere with a URL that decayed in the team wiki. Slightly more useful, still not load-bearing on the merge decision. The thing that made coverage stick was a sticky bot comment on every PR with three pieces of information: the overall coverage percentage, the delta versus the base branch, and a list of changed files where coverage dropped.

The third piece is the one that did the work. Knowing that overall coverage went from 87.2% to 87.1% is a fact you can rationalize away. Knowing that payments/refund.ts went from 94% to 71% on this PR specifically is a thing your reviewer is going to ask you about. The diff and the metric are co-located, and the metric is scoped to the change, not aggregated over the whole codebase.

Modern coverage actions tighten this further: they re-edit the same comment when you push new commits instead of stacking sticky notes, they show only files modified in the PR to keep the noise down, and they fail the check if the delta on changed files crosses a threshold. None of this is interesting individually. Together, they make coverage a thing that happens to your code review without anyone having to ask for it.

The Engineering Investment Behind a Useful PR Comment

Replicating that posture for LLM evals takes more than wiring up promptfoo eval or braintrust eval-action in a GitHub workflow. The wiring is the easy part — both ship an action that posts a comment. The hard part is making the comment carry information that survives a busy reviewer. Four pieces have to land.

Per-PR delta, scoped to what changed. Posting "factuality: 0.87" on every PR is a coverage-percentage-on-the-overall-codebase mistake. The reviewer needs to see the delta versus the base branch, on the slice of the eval suite that the change actually touches. If the PR modified prompts/extraction/invoice.md, the comment should report numbers for the invoice-extraction subset of the eval suite, not the global average where invoice-extraction is 4% of the dataset. Aggregate metrics hide segment regressions: a model swap can hold the global score steady while a 15% slice tanks by 30 points, and your reviewer will never see it because the global number didn't move.

Blast-radius callout — which downstream consumers does this touch. Prompts compose. A change to a "summarize" prompt that's used by three product surfaces is a different change than one used by twelve. The PR comment should list the consumers — agents, chains, downstream prompts — that import or invoke the modified template, and run their evals too. Without this, the change to the "summarize" prompt looks small in the diff and turns out to have been the upstream of every eval regression for the next sprint. Static analysis on a prompt monorepo, plus a registry of who-calls-whom, makes this tractable.

Inline regression highlights, not just summary numbers. The most useful artifact a coverage tool emits is the list of changed lines that aren't covered. The eval-comment equivalent is the list of golden-dataset examples that newly fail (or newly pass): three to five rows showing input, expected output, previous output, new output, and the judge's verdict. A reviewer who sees "this PR causes example #142 to start failing — input was a refund request with embedded HTML, the new prompt drops the HTML and the extractor returns nothing" will catch the bug. A reviewer who sees only "factuality: 0.87 → 0.84" will rationalize that scores fluctuate.

A latency budget that doesn't break the loop. A code coverage report runs in seconds because the tests already had to run. An LLM eval against a thousand-example golden set behind a remote judge can take twenty minutes and cost real money. Both kill the loop in different ways: twenty minutes is long enough that the reviewer moves on and never comes back to read the comment, and a real-money cost per PR makes the team disable the gate the first time finance asks. The fix is layered evaluation — fast deterministic checks (schema validation, regex, length budgets) on every PR, heuristic scoring on every PR with caching against unchanged inputs, and LLM-as-judge only on the affected slice rather than the full suite. The full eval still runs nightly; the PR sees the slice.

Failure Modes That Hide Behind a Green Checkmark

A handful of failure modes recur across teams that have wired up an eval action but haven't done the work above.

The most common is the aggregate green check. The PR comment shows "factuality: 0.87, all checks passed," merges happen, and a slice-level regression ships. This is the segmentation problem. The fix is per-segment thresholds with explicit segment definitions checked into the repo, so the action fails when any defined slice degrades by more than the slice-specific threshold rather than when the global mean does.

The second is judge drift. LLM-as-judge metrics depend on a judge model whose behavior is itself a moving target. A judge swap or version bump silently re-baselines every score in the suite, and the next PR shows a "regression" that's actually a judge change. The mitigation is pinning the judge model and snapshotting baseline scores per release, so a delta against the base branch is comparing against scores produced by the same judge.

The third is the cost regressed but quality didn't gap. The comment shows quality numbers and nothing else, the PR moves traffic to a smaller model, quality holds, and the cost ledger doesn't notice that token usage tripled because the new prompt is more verbose. The fix is putting cost and latency in the comment as first-class columns alongside quality, with their own deltas and thresholds. A merge gate that checks for performance budgets, not just quality, catches the case where the prompt gets cheaper to run on each input but the new template makes inputs longer.

The fourth is the eval-as-test-of-the-eval. Teams optimize the prompt against the golden set until the golden set is the only thing it does well. The protection is a held-out evaluation set that the team does not look at during prompt iteration and that is checked in the PR comment as a separate, non-blocking metric. When the golden set improves and the held-out set degrades, the team has overfit to the eval and the comment will say so before production does.

What "Evals Gate Merges" Actually Looks Like

The end state isn't more dashboards. It's that the merge button — the same merge button engineers click on every other PR — won't enable until the eval comment shows a green delta on the affected slices, and the reviewer's eyes have already passed over a list of three to five concrete examples that newly pass or fail because of the change.

Concretely, the artifact is a sticky comment on every PR that touches a prompt or a chain config. The comment has four sections. A summary line: "evals on 4 affected slices, factuality +0.02, refusal-rate -0.01, cost +12%, latency +0ms." A per-slice table with deltas and pass/fail against thresholds. A blast-radius callout: "this prompt is invoked by 7 chains; ran their evals too — all green except support-triage which dropped 3 points on the urgent-ticket slice." And a regression-highlights block: three examples that newly fail, with input, expected, previous output, new output, and judge verdict, each linked to the full record in the eval platform.

The reviewer reads this in fifteen seconds. The summary line lets them skip to the table; the table lets them skip to the slices that moved; the regression highlights make at least one bug visible per PR that has bugs. Approval is informed; rejection is specific. None of it requires the reviewer to leave the GitHub PR view, and none of it requires opening a dashboard.

The investment to get there is not in the eval framework — that part is commodity. The investment is in the four pieces above: per-PR scoping, blast-radius analysis, inline regression highlights, and a latency budget that keeps the comment fast. Build those, and "we have evals" turns into "evals gate merges, and the reviewer is faster, not slower." Skip them, and the eval action ships, the comment posts, and the team is back to rationalizing aggregate numbers two weeks before the slice-level regression hits a customer.

The lesson from coverage is that the metric matters less than the surface it lives on. A weak metric in the PR comment will outperform a perfect metric in a dashboard, every time. Move your evals into the comment. The rest is implementation detail.

References:Let's stay in touch and Follow me for more thoughts and updates