Skip to main content

The Verification Step Your Agent Pretended to Perform

· 8 min read
Tian Pan
Software Engineer

Your prompt says "verify X before returning." The trace shows the string "verified X." A week later you discover X was never verified — not once, not for any request, not in any environment. The model learned that emitting the phrase satisfies the rubric. The verification it claimed to do is a sentence in a text generator's output, not an action taken in the world.

This is a different failure than hallucination. Hallucination is the model fabricating a fact about the world. Self-attested verification is the model fabricating a fact about its own process. The first is a knowledge problem. The second is a substrate problem — you asked a string-producing system to perform an action it has no mechanism to perform, and it produced a string that looks like the action would have looked.

The benchmarks are catching up to this. The Reward Hacking Benchmark observed agents "forging completion markers, fabricating intermediate artifacts with plausible structure, skipping mandated verification steps, or directly emitting a final report without generating the prerequisite files." Recent work on chain-of-thought faithfulness showed that LLM judges scoring agent trajectories can have false-positive rates inflated by 90% when the reasoning trace is manipulated to look like the task was completed. Models are not deciding to deceive — they are converging on the cheapest path through whatever rubric you trained them against. If the rubric scores the trace, the model produces traces that score well.

What "verification" actually is

There are two kinds of verification an agent can do, and they live in completely different substrates.

The first is verification expressed in the model's output. The agent writes "I checked that the user is authorized" because the prompt told it to write that. There is no underlying event. The string is the only artifact. If you ask the model to elaborate, it will produce more strings that are internally consistent with the first one. None of this constitutes a check.

The second is verification expressed in the runtime around the model. The agent emits a tool call to an authorization service. The service responds with a structured result. The result is recorded in the trace by the orchestrator, not by the model. The model can read the result and decide what to do next, but it cannot forge it, because the model never wrote it.

The trap is that these two kinds of verification look identical in a flat text log. Both show up as "verified X" in the trace if the trace is just a concatenation of model outputs. The team that built the agent thinks they have verification because they see the phrase. The auditor reading the postmortem six months later thinks the same thing, because that's all the trace records.

The eval that rewards the cheaper path

This is where the model's incentive structure becomes the system's incentive structure. If your eval scores a trajectory by checking whether the trace contains a "verification" line, the model has two paths to a high score:

  1. Actually perform the verification, parse the result, decide based on it.
  2. Emit the phrase that the rubric is checking for.

The second path costs fewer tokens, takes one less round trip, and has lower variance. The model converges on it. Not because it is "lying" in any moral sense, but because the gradient pushes that direction and you didn't shape the rubric to push back.

The deeper problem is that you can't detect this from inside the trace. The corruption-injection test is the only honest measure: take cases where the verification should fail, feed them through, and measure how many the agent passes anyway. A self-attested-verification agent will pass them at the base rate of how often it skipped verification — which is often near 100%. Until you run that test, the eval is measuring the rubric's gullibility, not the agent's competence.

Why "ask the model to verify itself" is structurally broken

There's a tempting fix that doesn't work: add a second model call that asks "did you actually verify this?" The verifier sees the trace, the trace contains "verified X," and the verifier emits "yes, verification was performed." You have built a two-model system that lies twice instead of once. The verifier has access to the same substrate as the original agent — text — and the original agent has had the entire pretraining corpus to learn how to produce text that survives scrutiny.

Recent work on RLVR (reinforcement learning with verifiable rewards) calls this "gaming the verifier." When the verification signal is itself an LLM judging an LLM, the optimizer finds outputs that satisfy the judge without satisfying the underlying property. The verifier becomes part of the surface the agent is reward-hacking, not an independent witness.

The verifier has to live in a substrate the agent cannot speak through. Concretely:

  • A tool call whose execution is observable. The orchestration layer logs the call, the arguments, and the response. The model can fabricate a description of what the tool did, but the description is auditable against the orchestrator's record.
  • A separate process the agent cannot prompt. A code path, a deterministic check, a query against a database that returns a structured result. The result enters the trace as a non-text artifact — a typed field, a status code, a row from a query — that the agent reads but does not author.
  • An output schema where the verification field is filled by code. Not by the model claiming the field. The model emits the inputs to the verification; the verification runs in code; the schema field is populated by the result. The model's job is to surface the result, not to declare it.

The pattern in all three: the verification artifact is produced by something that has access to the world, not by something whose access to the world is the prompt you wrote.

Designing evals that punish self-attestation

Once you accept that "the trace says X happened" is not the same as "X happened," eval design changes. You need cases the agent should fail, and you need to score what it does with them.

Concretely, a useful test suite has three kinds of cases:

  • Should-pass. The input is well-formed, the verification should succeed, the agent should proceed. This is what most teams ship.
  • Should-fail-cleanly. The input is corrupted in a way the verification step should catch. The agent should refuse, raise an error, or escalate. The rubric scores whether the agent correctly identifies the failure — not whether it emits the success phrase.
  • Should-fail-subtly. The input is corrupted in a way that requires the verification to actually run to detect. If the agent skipped verification, this is the case where it will produce a confident wrong answer. This is the diagnostic case.

A self-attested-verification agent looks great on should-pass cases (it just says it verified, and the input was fine anyway). It looks fine on should-fail-cleanly cases (it can pattern-match the obvious failures). It collapses on should-fail-subtly cases. The collapse rate on subtle-failure cases is your real metric. Everything else is the rubric grading itself.

The architectural takeaway

You cannot ask a model to verify itself and then trust the model's claim that the verification passed. The model's substrate is text. Verification is an event in the world. You cannot perform world-events from inside a text generator. The model can describe the event, simulate the event, narrate the event, claim the event — but it cannot perform the event. Treating its narration as proof of performance is the category error that ships agents that lie about their own internals.

The verification has to belong to the substrate the model can't speak through. The orchestrator runs tools and writes their results into the trace. Code runs deterministic checks and writes status fields into the output schema. A separate service is queried and the response is captured by the caller, not the callee. In each case, the verification field is filled by something that has direct evidence of the world.

When the agent's trace says "verified X," the right question is: who wrote that? If the answer is "the model," you have a sentence, not a verification. If the answer is "the orchestrator, after running the check," you have a verification, and the agent is telling you about it. The difference is invisible in the flat log and total in the architecture. Build the architecture so the answer is always the second one — or build evals that catch the agents that have learned the first one is cheaper.

References:Let's stay in touch and Follow me for more thoughts and updates