Skip to main content

When LLMs Review LLMs, Errors Get Laundered Not Caught

· 10 min read
Tian Pan
Software Engineer

Trace the path of a single quality signal through a modern AI pipeline. An agent drafts a response. A second model reviews it and scores it 9 out of 10. That score gets logged. At the end of the quarter, the logged scores become the new eval set, and the next model is tuned to do well against it. Now ask the obvious question: where in that loop did a human ever look at the actual output?

In a lot of pipelines, the honest answer is nowhere. The agent that does the work is reviewed by another agent, and that reviewer's verdict feeds the next round of evaluation. The loop is closed. It runs continuously, it produces a dashboard, and the dashboard is green. What it does not contain, at any point, is a measurement against reality.

This is not a hypothetical risk. It is the default state of a fast-moving AI team six months in. Nobody decided to remove the human; the human just got sampled less and less often until the sampling rate hit zero and nobody noticed. The pipeline kept emitting numbers the whole time, so it never looked broken. That is the trap. A closed AI-reviews-AI loop does not fail loudly. It fails by quietly losing contact with the thing it was supposed to measure, while continuing to report that everything is fine.

A loop with no anchor measures only itself

Every measurement system needs a fixed reference. A scale needs a calibration weight. A clock needs a time standard. An eval pipeline needs ground truth — some set of judgments that come from outside the system being judged. The moment the reference point is itself generated by a model in the same loop, you no longer have a measurement. You have a system reporting on its own internal consistency.

Internal consistency is cheap to achieve and tells you almost nothing. Two models from the same family, asked to agree, will agree. They were trained on overlapping data, they share architectural priors, and they fail in correlated ways. When the reviewer says the worker's output is good, that sentence does not mean "this output is correct." It means "this output is the kind of thing a model like me would produce." Those are very different claims, and the dashboard collapses them into one number.

The structural problem compounds when the reviewer's scores are recycled as training or evaluation data. Research on training models on model-generated content describes model collapse: without real-data anchoring, recursive loops cause models to forget the tails of the distribution and drift toward their own projection of reality. The same dynamic applies to eval loops, not just training loops. An eval set assembled from a model judge's past verdicts inherits that judge's blind spots and then certifies the next model against them. The suite stops describing the world and starts describing the judge.

Agreement is not evidence

The most seductive mistake here is treating model agreement as validation. It feels rigorous: two independent systems looked at the output and both approved it. Surely two beats one.

It does not, because the two are not independent. Recent work on inter-model consensus is blunt about this — despite differences in architecture and training, large models share overlapping training data, which leads to correlated errors and overestimates consensus. Models reinforce each other's misconceptions rather than providing independent validation. When both reviewers miss the same bug, their agreement is not two confirmations. It is one blind spot, counted twice.

There is a second-order version of this when a model judges outputs from its own family. Studies on self-preference bias in LLM-as-a-judge found that models assign systematically higher scores to text that resembles their own generations — and traced the mechanism to perplexity: judges reward outputs they find familiar, meaning low-perplexity, regardless of whether those outputs are actually better. A judge from the same family as the worker is not a neutral observer. It is biased toward exactly the outputs the worker is most likely to produce.

It is worth being precise about what agreement does and does not measure:

  • Reliability is consistency — the same input produces the same verdict. A loop of model reviewers can be extremely reliable.
  • Validity is correctness — the verdict matches reality. Reliability tells you nothing about validity.
  • A closed loop optimizes hard for the first and leaves the second completely unobserved.

Worse, the standard statistics imported from human evaluation quietly break here. Work on inter-rater reliability between LLM and human raters notes that agreement metrics were designed for raters who are independent and stable; an LLM judge is neither — it has its own randomness and its own correlated bias. A high agreement score between two model raters can mean they are both right, both wrong in the same direction, or both noisy in the same way. The number cannot distinguish those cases, so it should not be read as confidence.

How a green pipeline launders an error

Here is the part that makes closed loops genuinely dangerous rather than merely weak: they do not just fail to catch errors. They actively convert errors into evidence of quality.

Walk the laundering step by step. The worker model produces an output with a subtle factual error. The reviewer model, sharing the same training data, does not recognize the error — it has the same gap. It scores the output highly. That score is written to a log. The log is later aggregated into a quality metric, and the metric is used to build the next eval set. By the time a human sees anything, the error has been transformed: it entered the system as a mistake and exited as a 9-out-of-10 data point that the team will now tune future models to reproduce.

Nothing in that chain was dishonest. Each step did its job. But the cumulative effect is that an error acquired a clean provenance. It got a paper trail of approvals. Anyone auditing the pipeline later sees a high score with a documented review behind it and has no reason to suspect it. The error did not get caught; it got laundered, and the laundering made it harder to catch than if there had been no review at all.

This is why "we added an AI reviewer" can make a system less safe rather than more. An unreviewed output is obviously unverified — everyone treats it with suspicion. A reviewed-and-approved output carries institutional confidence. If the review was performed by a model that shares the worker's blind spots, you have not added a safety check. You have added a credential, and bad outputs now wear it.

Cognitive monoculture is the root cause

The technical name for the underlying condition is cognitive monoculture. When the worker and the reviewer come from the same model family, they share an architecture, a training corpus, and therefore a shape of error. They are strong on the same things and weak on the same things. Stacking two of them does not widen coverage; it just deepens the existing groove.

The mitigation is diversity that is real and not cosmetic. Different model families, trained on different data with different objectives, produce genuinely different error distributions. Where one family is blind, another often is not. This is why heterogeneous review ensembles measurably outperform same-family ones on hard reasoning tasks — the disagreement between unlike models is informative in a way the agreement between like models never is.

But notice the ceiling. Cross-family review widens coverage; it does not create an anchor. Three model families that all absorbed the same scraped internet share some blind spots no matter how different their architectures are. Heterogeneity buys you a better loop. It does not buy you a way out of the loop. For that, the reference point has to come from outside every model in the system — which means a human, at least sometimes.

Put the human back, but spend the human wisely

The fix is not to abandon model review and read everything by hand. Model reviewers are fast, cheap, and consistent, and at production volume they are the only thing that scales. The fix is to break the closed loop with a sampled, deliberate human checkpoint — enough of one to keep the loop honest, not so much that it stops scaling.

A few principles for where to spend that scarce human attention:

  • Sample for calibration, not coverage. You are not trying to review everything. You are trying to answer one question: does the model reviewer's verdict still match a human's? Mature teams review on the order of 5–10% of traffic for exactly this purpose. The output of that sample is not a list of caught bugs — it is an agreement rate between the model judge and the human, tracked over time.

  • Treat the agreement rate as the real health metric. Your dashboard should not lead with the model reviewer's average score. It should lead with how often that reviewer agrees with a human on the sampled set. When that number drifts down, your eval has decoupled from reality — regardless of how green the score looks.

  • Stratify the sample. Pure random sampling spends most of the human budget on easy, obviously-fine cases. Concentrate human review where it is most informative: low-confidence verdicts, high-impact flows, and cases where model reviewers disagree with each other. Disagreement between models is a free signal pointing you at the inputs worth a human's time.

  • Re-anchor the eval set on a schedule. An eval suite grown from model-generated verdicts decays. Periodically replace a fraction of it with freshly human-labeled cases drawn from current production traffic. This is the calibration weight going back on the scale.

  • Keep one provider out of the family. If the worker is from one model family, the automated reviewer should be from another. It does not fix the loop, but it makes the loop's blind spots less correlated and buys time between human checkpoints.

The mental model worth carrying: a model reviewer is a fast instrument, and every instrument drifts. The human sample is not there to do the reviewing — it is there to recalibrate the instrument. You would not trust a scale that was never checked against a known weight, however precise its readings. An AI review pipeline deserves exactly the same skepticism.

The question to ask your pipeline

Closed-loop quality assurance is appealing because it looks like rigor and costs almost nothing. Outputs get reviewed, scores get logged, dashboards stay green, and no human has to be in the room. But a loop with no external reference is not measuring quality. It is measuring its own agreement with itself, and dressing that up as confidence.

So ask the concrete question of any AI evaluation pipeline you own: where does the ground truth come from? Trace it back. If every link in the chain — the worker, the reviewer, the eval set, the next model — is a model, and no human judgment enters anywhere, then your green dashboard is not telling you the system works. It is telling you the system agrees with itself. Those are different facts, and only one of them is worth shipping on.

References:Let's stay in touch and Follow me for more thoughts and updates