Skip to main content

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.

The pattern to copy is already in your repo. Code coverage made the same journey twenty years ago: from a nightly job whose HTML report nobody opened, to a per-PR delta posted as a sticky comment with file-by-file callouts. Most engineers cannot remember a time when coverage wasn't inline, and that's the point — once the feedback was adjacent to the change, the practice survived without anyone needing to be sold on it.

What "We Have Evals" Actually Means in Practice

Walk into any team that has been doing LLM work for more than a quarter and ask where the evals run. The answer is almost always the same: there's a script, it runs against a "golden dataset," and the results land in a notebook, a spreadsheet, or — if the team is feeling mature — a hosted dashboard. The script gets invoked manually before a release, or on a cron, or whenever a PM asks "did we improve the summarization?"

This setup makes evaluation a job — a thing you trigger, watch run, and interpret. Jobs have inertia. A job that takes ten minutes will not be triggered on every commit; a job that produces a hundred-row table will not be read on every commit; a job that lives outside the code review tool will not block a merge. The team has built a measuring instrument and then placed it three rooms away from the experiment.

The natural drift is predictable. The eval gets run on the days it gets run. PRs that touched a prompt get merged based on whether the engineer "felt good about the change" and whether reviewers could spot anything obviously wrong by reading the new template. The eval, when it does run, often regresses by two or three points on a slice nobody noticed, and the team doesn't connect the regression to the merge that caused it because the connection isn't visible in the tool where merges happen.

This is not a tooling failure of the eval framework. The frameworks are fine — DeepEval, promptfoo, Braintrust, Latitude, Traceloop, and a dozen others can all produce competent metrics on a golden dataset. The failure is that the metrics are produced somewhere other than the surface where the merge decision is made.

What Code Coverage Got Right

Code coverage as a metric has always been weak. Hitting a line of code is not the same as testing it; 100% coverage on a function that returns the wrong type is still 100% coverage; teams that chase the number write tests that exercise lines without asserting on behavior. Despite all of that, coverage as a practice survived because of one specific thing: the report moved into the PR comment.

The progression matters. Early coverage tools produced HTML reports — a directory of files you opened in a browser, color-coded line by line. They were genuinely useful and almost nobody used them. The next step was the dashboard, hosted somewhere with a URL that decayed in the team wiki. Slightly more useful, still not load-bearing on the merge decision. The thing that made coverage stick was a sticky bot comment on every PR with three pieces of information: the overall coverage percentage, the delta versus the base branch, and a list of changed files where coverage dropped.

The third piece is the one that did the work. Knowing that overall coverage went from 87.2% to 87.1% is a fact you can rationalize away. Knowing that payments/refund.ts went from 94% to 71% on this PR specifically is a thing your reviewer is going to ask you about. The diff and the metric are co-located, and the metric is scoped to the change, not aggregated over the whole codebase.

Modern coverage actions tighten this further: they re-edit the same comment when you push new commits instead of stacking sticky notes, they show only files modified in the PR to keep the noise down, and they fail the check if the delta on changed files crosses a threshold. None of this is interesting individually. Together, they make coverage a thing that happens to your code review without anyone having to ask for it.

The Engineering Investment Behind a Useful PR Comment

Replicating that posture for LLM evals takes more than wiring up promptfoo eval or braintrust eval-action in a GitHub workflow. The wiring is the easy part — both ship an action that posts a comment. The hard part is making the comment carry information that survives a busy reviewer. Four pieces have to land.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates