AI Code Review Drift: When Your LLM Reviewer's Standards Mutate Faster Than the Code
The PR-review dashboard has shown green for six weeks. Bot catch rate, comment volume, developer "thumbs up" reactions — all steady. Then a security incident lands in production and the post-mortem points at a missing null-check the bot used to catch and quietly stopped catching about two months ago. Nobody changed the bot. Nobody downgraded the model. The dashboard never moved. The standard moved.
This is the failure mode of automated code review that doesn't show up in any product demo. Teams adopt an LLM reviewer for the consistency win — every PR gets the same checklist, no senior engineer's bad-day variance, fast turnaround for junior contributors — and the consistency is real for about a quarter. Then the system prompt evolves, the model bumps, the few-shot library accumulates, and the bot is reviewing a different codebase against a different rubric using a different model than the one the team validated against. The team's mental model of "what the bot catches" decays into "what the bot caught last week."
The Reviewer Is Not a Single Artifact
It is tempting to think of an LLM code reviewer as one thing: the bot. In practice it is a stack of independently drifting components, and any of them can shift the standard without a release note.
The system prompt is the first surface. Most teams iterate on it weekly. Someone adds "be more concise" after a developer complains about wall-of-text comments. Someone adds "flag missing tests" after a near-miss. Someone removes a clause about coding style because the linter handles it now. Each edit is a small, locally reasonable change. The aggregate is a rubric that nobody has read end-to-end in two months.
The model is the second surface. Provider weight updates are now routine — OpenAI, Anthropic, and Google all ship behavior-affecting refreshes inside a single model ID without bumping the version string. Even pinned snapshots aren't truly pinned: safety filters get updated, decoding parameters shift, and aliases like claude-sonnet or gpt-5 auto-upgrade when the provider rolls a new snapshot. The alias subscriber experiences every minor change as an unannounced production deployment with no rollout staging.
The few-shot examples are the third surface. Someone added a perfect "good catch" example after the auth incident. Someone added a "do not flag this style" example after a developer pushed back on a nit. The examples shape behavior more aggressively than the instructions do, and they are usually checked into a different file than the system prompt — or worse, into a config UI that produces no diff in code review.
The retrieval-augmented context is the fourth surface. Modern review bots pull in repo conventions, past PR decisions, and code-owner preferences. The retrieval index is rebuilt nightly. The conventions document was edited last Tuesday. The behavior shifted on Wednesday.
The eval set the team uses to validate the bot is the fifth surface. And the eval set itself drifts, because the team that maintains it is the same team whose standards are shifting. You cannot use a moving ruler to detect that the ruler is moving.
Catch Rate Is the Wrong North Star
Most public benchmarks for AI code review tools report a single headline number: percent of seeded bugs caught. Recent independent evaluations put the leaders in the 50–80% range, with Greptile around 82%, Bugbot and Copilot in the mid-50s, CodeRabbit at 44%, and Graphite at the low end on catch but the high end on signal-to-noise. The numbers move every few months.
The number that doesn't appear on the benchmark page is the second derivative: how much did the catch rate change since the last bot update, and on which categories did the change concentrate? The bot that catches 80% of bugs on average can be catching 95% of null-pointer issues and 40% of authorization issues, and a model bump can flip those numbers without changing the aggregate. A team monitoring only the headline number sees a flat line. A team monitoring per-category catch rates sees the auth-issue catch rate collapse two weeks before the incident.
False-positive rate is the other half of the lie. Industry-standard ranges sit at 5–15%, and a 10% false-positive rate on a team reviewing 50 PRs a week with five comments per PR is 25 false flags weekly — over 20 hours of investigation time per month spent chasing the bot's noise. Catch rate trades off against false-positive rate, and a bot update that nudges one up usually nudges the other up too. Without measuring both on a stable corpus, the team has no way to know which way the bot's calibration moved.
- https://www.greptile.com/benchmarks
- https://www.codeant.ai/blogs/ai-code-review-false-positives
- https://graphite.com/guides/ai-code-review-false-positives
- https://github.blog/ai-and-ml/github-copilot/60-million-copilot-code-reviews-and-counting/
- https://docs.github.com/en/copilot/concepts/agents/code-review
- https://eleks.com/expert-opinion/evaluating-github-copilot-code-review/
- https://www.langchain.com/articles/llm-as-a-judge
- https://blog.ml.cmu.edu/2025/12/09/validating-llm-as-a-judge-systems-under-rating-indeterminacy/
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://agenta.ai/blog/prompt-drift
- https://futureagi.com/blog/what-is-llm-drift-2026
- https://www.traceloop.com/blog/catching-silent-llm-degradation-how-an-llm-reliability-platform-addresses-model-and-data-drift
- https://stackpulsar.com/blog/llm-model-drift-detection/
- https://www.devtoolsacademy.com/blog/state-of-ai-code-review-tools-2025/
- https://arxiv.org/html/2505.20206v1
