Skip to main content

The Rerun Antipattern: Why Rolling Again Doesn't Find Bugs

· 10 min read
Tian Pan
Software Engineer

The first thing most engineers do when an AI feature misbehaves is click "run" again. The model is stochastic, the thinking goes, so maybe this run was just unlucky. When the second attempt produces something that looks reasonable, the ticket gets closed. The team moves on. The actual bug — a stale tool response, a retrieval miss, a system-prompt conflict that fires only on inputs containing a specific token — sits in production, intact, waiting for the next user to trip it.

This is the rerun antipattern, and it is the most expensive debugging habit AI teams have inherited from the chatbot era. It feels rigorous because the model genuinely is non-deterministic. It looks like a variance probe. But almost no one writes down a hypothesis before they reroll, no one decides in advance how many runs would constitute evidence, and no one accounts for the tokens. What's happening is closer to slot-machine debugging: you pull the lever until the lights stop flashing red, and you walk away convinced the machine is fine.

The Reroll Looks Like Variance, but Acts Like Survivorship Bias

There is a legitimate version of the same physical action. A deliberate N-of-K sample — say, ten runs at temperature 0.3 with a written hypothesis like "I expect the citation step to fail in roughly 30% of cases" — is a real diagnostic technique. It tests whether a failure mode is rare or routine, and it gives you a denominator. Researchers have formalized this under "pass@k" and have shown that variance reduction across resampling is a load-bearing primitive for any honest LLM evaluation. The Good, the Bad, and the Greedy paper is now a standard reference for why ignoring non-determinism in evaluation produces misleading rankings.

The despair rerun looks identical from the outside but carries no hypothesis, no documented K, and no decision rule for what counts as success. It is structurally a survivorship-bias machine. Out of ten silent reruns, the engineer remembers the one that worked, ships nothing, and learns nothing. The failure modes that show up only on inputs of a certain shape — long contexts, specific tool outputs, particular punctuation in the user query — are precisely the ones that reroll cannot surface, because nothing in the loop is varying the input deliberately.

A useful test: if you cannot answer "what would make me believe this bug is real and not flakiness, and how many runs would I need to know," you are not running a variance probe. You are pulling the lever.

Why Reruns Mask Real Bugs

The model genuinely is non-deterministic, and that is the cover story the antipattern hides behind. But the non-determinism budget is much smaller than most engineers assume, and the bugs it disguises are usually not stochastic at all.

Even at temperature 0, identical inputs do not guarantee identical outputs. Floating-point non-associativity in GPU reductions, dynamic batch sizes that change kernel scheduling, mixture-of-experts routing whose decisions depend on what other sequences are in the batch — these all leak entropy into "deterministic" inference. Recent work from Thinking Machines traces the problem to batch invariance specifically, and the Eval4NLP 2025 paper on hosted-model determinism quantifies accuracy swings of up to 15 percentage points across runs of supposedly identical configurations. So yes, the model can return different tokens for the same input.

But that is the noise floor. It is not what is taking down your feature. The bugs that the rerun antipattern most reliably masks are deterministic-but-conditional: a tool that returns stale data because a cache TTL is wrong, a retrieval index that has a hole on a specific date range, a system prompt whose instructions contradict each other on a narrow class of user intent, an agent loop that compounds a small upstream error into a large downstream one only when the chain is more than three hops deep. Each of these will fail every time the input matches the trigger and pass every time the input does not. Reroll does not change the input. Reroll therefore cannot distinguish the two cases. It can only tell you that the model sometimes lands on a path that doesn't trigger the bug, which is information the user already had when they filed the ticket.

The engineers who fall hardest into this trap are the ones with the strongest backend-debugging instincts, because the muscle memory from deterministic systems — "if it's flaky, hammer it until it stabilizes, then look at the diff" — produces exactly the wrong behavior when the substrate is non-deterministic but the bug isn't.

Trace First, Then Reproduce

The opposite stance is to treat a single failed run as a fully analyzable artifact. One trace, end to end, with every prompt, every tool call, every retrieval result, every intermediate state, captured before you decide whether the bug is worth reproducing. This is the move that distinguishes teams who debug AI systems quickly from teams who flail.

A trace lets you ask the questions reroll cannot answer. Did the model see the wrong context? Was the tool input malformed in a way the agent silently absorbed? Did a retrieval call return zero documents, or the wrong documents? Was a tool call retried five times because the first response failed schema validation, burning tokens and corrupting the conversation history along the way? You cannot answer any of these from output text alone. You can answer all of them from a structured trace.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates