Skip to main content

The Reflection Placebo: Why Plan-Reflect-Replan Loops Return Version One

· 9 min read
Tian Pan
Software Engineer

Open an agent's trace during a long-horizon planning task and count the number of times the model writes "let me reconsider," "on reflection," or "a better approach would be." Now compare the plan it finally commits to with the one it drafted first. In the majority of traces I've audited, the second plan is the first plan wearing a different hat — the same decomposition, the same tool calls, the same order of operations, with some renamed step labels and a reworded rationale. The reflection ran. The model emitted tokens that looked like reconsideration. The plan did not move.

This matters because "with reflection" has quietly become a quality tier. Teams ship planners with one, two, or three reflection rounds and bill themselves for the difference. The inference spend is real and measurable. Whether anything on the plan side actually changed is a question almost nobody instruments for, and the answer is frequently: no.

Why the second sample looks a lot like the first

When you ask a model to draft a plan, the first sample is drawn from the high-probability region of its output distribution for that prompt. That is what sampling means. When you then ask the same model — same weights, same context window, same system prompt — to "reconsider the above plan and propose improvements," the follow-up is drawn from the same distribution, now conditioned on the first plan. The prompt has changed, but the model's prior over "what a good plan for this task looks like" has not. The most probable revision is the one that validates the first draft, because that is what the training data taught the model to do when shown a reasonable-looking artifact: justify it.

Two failure modes dominate. The first is confirmation bias. When the same model acts as planner, evaluator, and reflector, the evaluation typically restates the initial reasoning and annotates it with additional confidence rather than surfacing the actual flaw. The second is mode collapse: the actor reproduces nearly identical solutions across retries despite receiving feedback, because post-training alignment has narrowed the distribution of responses the model considers likely for a given prompt shape. Recent work on verbalized sampling documents this at the data level — annotators in preference training systematically favor familiar outputs, so aligned models learn that the familiar answer is the preferred answer, even when the explicit instruction is to produce something different.

The upshot is that reflection in the same model, on the same prompt, inside the same context, is a very expensive way to ask the model to agree with itself.

Instrumenting what reflection actually produces

The reason most teams don't notice this is that they don't measure it. They measure whether reflection was emitted (it was) and whether the final answer was right (sometimes it was, but how often because of the reflection?). What they almost never measure is the diff between the pre-reflection plan and the post-reflection plan at the level that matters.

Three measurements are worth running on any production planner that claims reflection as a feature:

  • Structural diff rate: across a sample of runs, how often does the top-level decomposition — the ordered list of major steps, ignoring wording — change after reflection? Not "did any tokens change" but "did the shape of the plan change." In planners I have audited, this number is typically under 15%.
  • Lexical-only change rate: of the runs where the plan changed, how many changes are lexical only — a step renamed from "fetch user data" to "retrieve user information," a bullet reordered, a rationale rewritten without altering what actually runs. Track lexical-only changes separately; they are noise masquerading as signal.
  • Tool-call delta: if you have serialized the plan into a sequence of tool calls, how often does that sequence differ after reflection? Many reflection rounds are expensive no-ops when measured at the interface the rest of the system actually consumes.

If fewer than one in five reflection rounds produces a structural change and the majority of the rest are cosmetic, you are paying for reassurance, not for planning quality.

The pile-on-more-reflection trap

The predictable response to a stuck planner is to crank the knob: two reflection rounds, then three, then a reflect-and-reflect-again meta-loop. The token bill climbs. The plan does not. Every additional round conditions the next sample on an ever-longer context that is now dominated by the model's own prior outputs — each of them drawn from the same distribution the first one came from. The Nth reflection is not drawing on new information; it is drawing on the first plan, through the lens of the first plan's justifications, through the lens of the second plan's reaffirmations.

This is also where token costs compound hardest. Agent loops are already quadratic in tokens because every step carries the full prior history forward. Reflection rounds multiply that baseline by the number of rounds, and they do so on a context that is growing with each round. A three-round reflection chain inside a ten-step agent loop can easily spend an order of magnitude more tokens than the bare plan would have — to produce the plan it would have produced bare. If your finance team is asking where the inference bill is going and your planner traces are full of reflection blocks, there is a good chance you have found the answer.

The stuck planner is telling you something, and the signal is not "reflect more." It is "this model, on this prompt, cannot diversify." The fix is not more of the thing that failed.

What actually produces divergent plans

Four techniques, drawn from recent work on diverse planning, consistently outperform single-model reflection chains for producing genuinely alternative plans. None of them is exotic, but all of them require that you stop asking one model to disagree with itself.

N-best planning with explicit divergence constraints. Instead of generating one plan and asking the model to reconsider, generate K plans in a single pass with an instruction that forces structural difference. "Produce three plans that differ in their top-level decomposition, not just their wording" is a very different request from "produce a plan, then reflect on it." The explicit divergence constraint makes the model sample from a wider slice of its distribution in the first place, rather than collapsing to the mode and trying to wiggle out of it afterward. Verbalized sampling — asking the model to enumerate candidates with their relative likelihoods — reliably produces two to three times the semantic diversity of single-sample prompting and does so without any training changes.

Planner-and-critic separation with different system prompts. If you want genuine critique, the critic cannot share the planner's system prompt. Give the critic a prompt that narrows its role — adversarial examiner, red-team reviewer, constraint checker — and a context that excludes the planner's self-justifying rationale. The empirical result across Planner-Actor-Critic architectures is that a narrow, specialized critic, even one running on a smaller backbone, produces substantively different feedback than a critic sharing the planner's full context, because it is not primed to validate what the planner already said.

Temperature-varied resampling. If you are going to generate multiple plans, do not draw them all at the same temperature. Draw some near the mode for a conservative baseline, draw some at higher temperature for exploratory alternatives, and then let a separate scoring pass rank them. This is cheaper than reflection rounds and produces more structural diversity per token spent, because each high-temperature sample is an independent draw rather than a conditioned revision of the previous draw.

Cross-model planner ensembles. The cheapest way to break confirmation bias is to use a different model. Two planners from different families, drafting independently, will disagree structurally far more often than one planner reflecting on itself three times. The cost comparison is often favorable — a second model's full plan is frequently cheaper than a third reflection round on an already-inflated context — and the diversity is categorical, not cosmetic. Ensembling across families is the operational expression of "this model has a prior; route around it."

The uncomfortable finding for teams that shipped with reflection

If you shipped reflection as a quality feature and you measure the three metrics above, you will likely find that most of your reflection spend produced cosmetic output and that the model was, in effect, charging you to reassure itself. That is a hard finding. It implicates a product claim, a prompt-engineering playbook, and in some cases a performance number on an internal benchmark that was never re-run after the reflection feature was added.

The honest response is to re-run the benchmark with and without reflection, bucketed by problem difficulty. The common pattern: reflection helps a little on problems the model would have gotten right anyway (noise in, noise out, final answer unchanged), does nothing on problems at the frontier of the model's capability, and sometimes hurts on problems where the first plan was correct and the reflection talked the model into a plausible-sounding wrong revision.

The fix is not to remove reflection entirely. Single-model reflection has real uses — format correction, explicit invariant checks, constraint-violation catches where the invariant is mechanically verifiable and does not rely on the model's judgment to surface. Those uses look nothing like "reconsider the plan." They look like "verify that every tool call in the plan targets a tool that exists" or "confirm the plan terminates." The discipline is to stop marketing structural replanning as a capability that single-model reflection does not have, and to route the budget toward the techniques that actually produce divergent plans.

What to do Monday

If you run a planner with reflection in production, here is the concrete next move. Sample a hundred reflection rounds from last week's traces. For each one, diff the pre-reflection plan against the post-reflection plan at the level of the tool-call sequence, not the prose. Bucket the diffs into three piles: structural change, lexical-only change, and no change. If the structural-change bucket is smaller than 20%, the reflection feature is a placebo and the budget it consumes should be redirected. If the structural-change bucket is healthy but the outcomes are no better, the reflection is moving the plan but not in a useful direction, and the critic prompt needs separation from the planner. Either way, you now have a number. "With reflection" as a marketing phrase is load-bearing only when someone is willing to count.

References:Let's stay in touch and Follow me for more thoughts and updates