The Self-Correction Loop That Shared Its Verifier's Blind Spot

June 2, 2026 · 10 min read

Software Engineer

The screenshot that gets passed around in agent post-mortems looks the same every time. A long trace. A single task. Twelve iterations. The agent generated a draft, evaluated it, found a minor flaw, generated a revision, evaluated it, found a slightly different minor flaw, generated another revision. The score the verifier returned hovered between 0.78 and 0.84 the entire time. It never crossed the threshold. The agent never escalated. The job timed out three hours later at a token bill that would have paid for a quarter of a senior engineer's day.

The team called this a "self-correction" problem because that is what the architecture diagram labeled it. The actual failure was structural. The verifier was the generator wearing a different prompt. The convergence criterion was the model's own opinion. The retry budget was implicit, capped by the agent timeout rather than by anything the agent itself reasoned about. None of those three failures look like bugs in isolation, which is why teams ship them.

The verifier sees what the generator sees

The first instinct, when an agent's output is wrong in a recoverable way, is to ask the model to check its own work. The second instinct, when that check passes a flawed answer through, is to ask the model to check the check. This stack of reflexive verifications feels like it should converge on something true. It usually does not.

A 2025 benchmark on self-correction in LLMs measured what happens when the same model both generates and critiques. Across fourteen models, the average rate at which a model failed to flag identical errors in its own output — errors it would have flagged in a user's input — was 64.5%. The failure is not random. It is systematic. The model's failure modes on generation map cleanly onto its failure modes on evaluation, because both are draws from the same underlying distribution of what the model thinks "good" looks like.

This is why a self-correction loop can iterate confidently to a wrong answer. Iterations 2 through 10 do not add information. They polish the prose, tighten the structure, hedge the claims that the generator was already uncertain about, and leave the load-bearing error untouched. The verifier never sees it, because the verifier is the same circuit that produced it.

The decomposition of self-correction into "confidence capability" (defending correct answers under challenge) and "critique capability" (turning wrong answers into right ones) is useful here. Most loops in production tune for the first and assume the second comes free. They don't. Increasing iteration count raises Expected Calibration Error — the model becomes more confident as it iterates, not more correct.

Convergence by exhaustion is not convergence

When a loop "converges" because the score stopped moving, the team running the loop should be asking what the score is measuring. If the score is the verifier's confidence that the answer is good, and the verifier shares the generator's blind spots, then a stable score is consistent with both "the answer became correct" and "the model ran out of ways to express its uncertainty." There is no way to distinguish those two states from inside the loop.

Production traces of agents stuck in this regime look distinctive. The same tool gets called with the same arguments, slightly rephrased. The score oscillates in a narrow band below the success threshold. The token cost climbs linearly. The reasoning steps repeat with synonym substitution. Latency grows. State does not change. The agent is doing work; the work is not productive. A Reflexion-style loop running ten cycles can consume 50× the tokens of a single linear pass, and the marginal value past round two is small enough that teams who instrument carefully often find rounds 3–10 are pure cost.

What makes this expensive is the budgeting structure most agents inherit. The cost is rebilled on every iteration because the full conversation history is replayed into the next prompt. A 20-step loop ends up consuming over 10× the tokens a simple per-step estimate would suggest. The bill scales quadratically with depth. A loop that "almost converges" for a long time costs more than a loop that fails fast.

"Let it try again" is a budget bomb

The instinct to give an agent room to retry is reasonable. Most one-shot LLM calls plateau around 60–70% accuracy on hard tasks. Iteration buys the gap to enterprise-grade reliability. The mistake is treating "iteration is good" as "more iteration is better." The return curve flattens fast, and the cost curve does not.

A few patterns recur in the loops that bleed budget:

No bounded retry policy. The agent retries until something else stops it — a timeout, a token cap, an external watchdog. The agent itself has no notion of giving up. There is no internal signal that this attempt is going nowhere.
No tool+args deduplication. The agent calls the same function with the same inputs and gets the same output and reasons over the same output again. Each pass adds latency and cost; none of them add information.
A convergence criterion that the loop controls. The threshold the agent is trying to clear is set by the same model deciding whether it has been cleared. The agent gets to grade its own homework, and any score short of the threshold is treated as a license to retry.
No staleness check. The agent doesn't notice that its last three attempts looked the same. There is no comparison between iteration N and iteration N-2 to detect that the loop is no longer changing state.

Each of these is fixable in isolation. None of them are typically fixed at design time, because the failure mode they prevent — burning a five-figure token bill on a single task — only surfaces in incidents, and most teams attribute the incident to "the agent got stuck" rather than to the architecture that let it.

The discipline of bounded self-correction

The pattern that holds up under production load has three ingredients. They are not novel; they are just frequently skipped.

A retry budget the agent cannot extend. The maximum number of self-correction iterations is fixed at the orchestration layer, not negotiated by the agent itself. Two or three rounds is a reasonable default. The research consistently shows that 75% of the improvement available from self-correction lands in the first two iterations; the rest is a tax. When the budget is exhausted, the loop exits — not into "try again with a different prompt," but into the escalation path described below.

A heterogeneous verifier. The model evaluating the output should not share the generator's blind spots. The cheapest version of this is using a different model family — Claude generates, GPT or Gemini judges, or vice versa. The more rigorous version is a dedicated verifier that was trained or tuned on a different objective: a process reward model, a code-execution sandbox that actually runs the output, a unit test the verifier did not write. Multi-agent verification frameworks scale this further, combining several verifiers whose individual blind spots are uncorrelated enough that their joint judgment is meaningful. The point is that the verifier must be epistemically independent of the generator. If it isn't, it isn't a verifier; it is a confidence amplifier.

An explicit handoff when the budget runs out. The loop's failure case is not "raise an exception and crash." It is "produce an escalation payload that a human can act on without re-deriving the agent's state." That payload includes the original task, every attempted output, every score, the verifier's stated objection, and the agent's own summary of why it could not converge. The handoff happens at the point where the budget is exhausted, not at the point where the agent has gone silent for an hour. Escalation is a feature of the design, not the catastrophe at the end of one.

Detecting the loop before the bill arrives

The teams that catch self-correction loops in production before the post-mortem stage usually instrument three signals. None of them are new; all of them are absent from most agent dashboards.

Spans that repeat with high textual similarity within a single trace are a strong signal. Two consecutive agent steps whose outputs are 90%+ similar by embedding cosine, paired with no state change in the tool calls between them, almost always indicate a loop that is no longer making progress. A simple alert on this pattern catches most cases before the trace runs to its timeout.

Token cost per trace is a lagging but reliable signal. A trace that costs 10× the median for its task type is either solving an unusually hard instance or stuck in a loop. Bucketing traces by task type and flagging cost outliers turns the bill itself into a detector. The traces that look expensive are the ones that need a human eye.

Verifier score trajectory is the most direct signal but the hardest to interpret. A score that climbs monotonically toward the threshold is converging. A score that oscillates in a narrow band is not. The pattern recognition is straightforward; the question is whether the orchestrator looks at the trajectory at all, or just at whether the threshold was eventually hit. Most do not.

What the loop is actually for

The deeper question is what self-correction is supposed to accomplish. If the goal is "make a wrong answer right," the loop only delivers when the verifier can see the wrongness — which is the case the generator's blind spot specifically rules out. If the goal is "make a marginal answer better," then two rounds is enough and the rest is theater. If the goal is "increase the agent's confidence that it can ship," then the loop is doing exactly what its architecture says: amplifying confidence without adding evidence. None of these are bad goals. They just require different machinery, and "let the agent verify and retry" is the machinery for only one of them.

The agent that converges on a wrong answer with high confidence is doing what its design rewarded it for doing. The fix is not a better prompt or a smarter model. It is an external verifier that does not share the generator's failure modes, a retry budget set by the orchestrator instead of by the loop's own opinion, and a handoff path that treats human escalation as a normal terminal state rather than an emergency. The teams that ship these three together stop having self-correction incidents. The teams that ship two of them keep having a smaller version of the same incident, slightly later, at slightly higher cost.

The loop that runs forever isn't broken. It is exactly as broken as the model's blind spot, exactly as expensive as the budget allowed, and exactly as confident as the verifier wanted it to be. The architecture chose all three.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Self-Correction Loop That Shared Its Verifier's Blind Spot

The verifier sees what the generator sees

Convergence by exhaustion is not convergence

"Let it try again" is a budget bomb

The discipline of bounded self-correction

Detecting the loop before the bill arrives

What the loop is actually for

Recommended Reading

About Tian Pan

The verifier sees what the generator sees​

Convergence by exhaustion is not convergence​

"Let it try again" is a budget bomb​

The discipline of bounded self-correction​

Detecting the loop before the bill arrives​

What the loop is actually for​

Recommended Reading

About Tian Pan

The verifier sees what the generator sees

Convergence by exhaustion is not convergence

"Let it try again" is a budget bomb

The discipline of bounded self-correction

Detecting the loop before the bill arrives

What the loop is actually for