Skip to main content

The Hot-Reload Loop Your Coding Agent Polluted

· 12 min read
Tian Pan
Software Engineer

A coding agent and a hot-module-replacement dev server are both, individually, magic. Put them in the same working directory and they become a producer-consumer pair with no synchronization primitive between them. The agent writes a file. The watcher fires. The dev server reloads to a state that exists for ninety milliseconds before the agent's next write replaces it. The error overlay reflects a snapshot the file system already moved past. The agent reads that overlay, treats it as ground truth, and writes a fix for a problem the next save will erase anyway.

You don't notice this on a one-line edit. You notice it when the agent is doing a coordinated multi-file change — renaming a prop across a component, threading a new field through a hook, splitting a module — and every intermediate state between "start" and "done" is, by construction, broken. The watcher does not know the difference between an intermediate state and a final one. The agent, observing the watcher's output, cannot tell the difference between a real error and an artifact of its own in-flight work.

What the team built was a tight feedback loop for a human developer who saves once, watches the reload, thinks, and saves again. What got wired in is a participant that writes faster than the loop's debounce was tuned for and reads the loop's output back into its own reasoning. That is not a feedback loop. That is a self-exciting oscillator with a token meter attached.

The Watcher Was Already Lying Before You Added an Agent

File watchers have always been an approximation. The kernel reports raw inode events; the watcher tries to translate those into "the developer saved the file." Editors do not cooperate. VS Code's atomic save produces a CREATE for a temp file, a RENAME to swap it into place, and often a follow-up WRITE — three events for one save. Vim with backupcopy=auto can emit five. JetBrains IDEs have their own safe-write sequence. The fsnotify community has been arguing about this for a decade.

The standard mitigation is debounce: collect events, wait for quiet, then emit one. TypeScript's tsc --watch debounces 250 ms via scheduleProgramUpdate. Common guidance for editor reload is 100–500 ms, which works fine for humans whose two consecutive saves are at least a few seconds apart. Vite ships single-file HMR updates in under fifty milliseconds in the best case and falls back to a full page reload when the change crosses an unhandled boundary or hits a circular dependency.

These numbers were calibrated for a participant whose thinking time dominates their typing time. A coding agent inverts that ratio. It can finish a tool call, observe the result, plan the next edit, and dispatch a new write in well under the watcher's debounce window. The watcher's quiet-period heuristic, which exists precisely to avoid firing on intermediate states, gets defeated by a producer that never goes quiet. The agent's edits appear, from the watcher's perspective, as one continuous tremor — and the dev server reloads against the wrong snapshot of it.

Three Failure Modes That Look Like the Agent Got Stupid

This pattern produces specific, recognizable behavior in the agent's transcript. None of them announce themselves as "we have a watcher race." They look like quality regressions.

The first is the phantom error spiral. The agent edits file A, planning to edit file B in the same logical change. After the first write, the watcher fires, the type checker re-runs against a half-applied change set, and the error overlay reports that A now references a symbol B doesn't yet expose. The agent observes the error and reasons about it: maybe the import is wrong, maybe the symbol should be named differently, maybe A's signature needs adjusting. It writes a "fix" for the phantom. By the time the fix lands, the original plan is forgotten, B has never been touched, and the diff that ships is a defensive patch around a problem that would have disappeared on its own.

The second is transient error oscillation. The agent makes a change. The error overlay clears. The agent makes a second change. The watcher fires before the second write completes (you can hit this with multi-megabyte file rewrites, or with sequences of small files that the agent dispatches in parallel). A momentary syntax error appears in the overlay — the kind that exists for one frame of the dev server's life — and the agent treats it as live state. The next edit walks back the second change. The error clears. The agent tries again. The transcript shows the agent making and reverting the same edit three times across thirty thousand tokens.

The third is lint-on-save amplification. ESLint with TypeScript plugins is computationally expensive — roughly ten seconds of overhead on a medium project. If the agent is editing through a watcher pipeline where lint-on-save fires after every write, the agent's read of the project's "current errors" lags the file system by the lint pipeline's latency. The agent will see, in turn N+2, errors that reflect the state of turn N — and reason about them as if they were current. Every reasoning step is grounded in a snapshot of the world from two edits ago.

In all three, the agent isn't broken. The harness wired the agent's senses to a signal that was never specified for this consumer.

The v0 Loop Was the Public Demo

This is not theoretical. Vercel's v0 product had a documented incident where the agent fell into an infinite read-edit cycle driven by a file-modification race condition and could not be stopped from the user side until the session was killed. The session burned tens of thousands of credits before anyone noticed it had gone reflexive. The proximate cause was that two participants — the agent and the platform's file-syncing layer — were both modifying the same files on overlapping timescales, and the agent's read of the result kept producing a delta it then tried to "fix."

The same pattern shows up wherever coding-agent harnesses run alongside live dev tooling. Watchman, Facebook's file watcher, is documented as incompatible with Claude Code's sandbox precisely because the two ways of observing the file system don't compose. Webpack and Next.js have a long list of HMR issues that surface when the file system changes faster than the bundler's snapshot model can keep up — issues a human developer hits occasionally and a writing-as-fast-as-it-can agent hits as a steady state.

The token-cost frame matters. Reports of agents calling the same tool seventy-three times in a session with slightly different arguments are not rare; one documented v0 incident exceeded forty-seven thousand tokens before the user managed to terminate. Loop-detection research has settled on a coarse heuristic: hash the agent's tool calls across consecutive turns, and treat two identical calls as a strong signal of a stuck loop. The watcher-driven oscillation defeats that heuristic because each successive "fix" is slightly different from the last — the agent is making real edits, just the wrong ones, against a moving target.

Patterns That Make Edit and Observe Stop Fighting Each Other

The fix is not to make the watcher faster or the agent slower. It is to introduce a synchronization primitive that was not in the original ergonomics design.

Transactional edit windows. Treat a related set of file changes as a single transaction the harness opens, completes, and then announces to the watcher. During the window, the dev server's reaction is paused or its output is gated. The agent commits a coherent change to disk as one event, not as a stream of partials. This is the same idea as a database transaction — multiple writes appear atomic to any observer — applied to the file system the watcher cares about. The implementation can be as cheap as a marker file the watcher's debounce logic respects, or as serious as a FUSE overlay that buffers writes until commit.

Quiet mode from the harness. Have the harness expose a "the agent is mid-edit" signal that the dev server, type checker, and linter can subscribe to. While quiet mode is on, watchers pause their reactions; when it clears, they reconcile against the final state in one pass. Modern dev servers expose enough of their internals to make this wirable: Vite's import.meta.hot.invalidate() and full-reload trigger are scriptable, and most linters in watch mode can be paused over an IPC channel. The cost is one configuration line and a discipline of bracketing multi-file work.

Observation gates. Between the dev server's output and the agent's context, sit a small filter. Tag every error overlay, type-check result, and lint diagnostic with the file-system timestamp it was computed against. Compare that timestamp to the agent's most recent write. If the diagnostic was computed against a state earlier than the latest write — i.e. the diagnostic is stale by construction — drop it before it enters the agent's context. The agent sees only diagnostics whose state is consistent with the state the agent itself just produced. The harness becomes responsible for the consistency contract the agent's senses depend on.

Edit-and-diff before save-and-watch. Have the agent stage all related changes in memory, run a self-review against the proposed final diff, and only then dispatch one write per file. The watcher fires once per file at most, and the dev server reloads against a state the agent has already committed to. This is closer to how a human developer with a multi-file change actually works — you don't save in the middle of a rename. The cost is that the agent loses the ability to "feel out" the change incrementally by reading intermediate compiler output. That ability was always more bug than feature in this setting.

The benchmark literature on agent file-editing strategies points in the same direction. Atomic file rewrites and script-generated patches are dramatically cheaper and faster than sequential edits — one published comparison clocks script generation at 3.5x cheaper and 6.5x faster than per-line streaming edits — and they have the additional property of crossing the watcher's threshold exactly once per file. The token-economics argument and the synchronization-correctness argument land on the same pattern.

Evaluating That You Actually Closed the Gap

A team that has installed any of the above still needs to verify the agent is no longer reasoning about its own shadow. The evaluation is more interesting than the implementation, because the failure mode is invisible to the agent's own self-report — the agent does not announce that it is in a watcher loop. It just gets quietly worse at the task.

Instrument the dev server to attribute every error overlay, type-check failure, and lint diagnostic to its causing change set, indexed by file-system timestamp. After an agent session, replay the trace and partition the diagnostics into two buckets: ones the agent should have seen (computed against a stable post-edit state) and ones the agent should never have seen (computed during an active edit window). The relevant ratio is what fraction of the agent's reasoning turns were grounded in the second bucket. Track that fraction over a release; it should fall toward zero as the synchronization layer matures.

For functional eval, run synthetic multi-file edit scenarios that exercise the cross-edit interval — rename a prop across a directory, thread a new field through three layers, split a hook into two — and grade the agent on whether the diff that ships is the diff the agent planned, or a defensive patch around an artifact it observed. The grader should be a deterministic comparison against the intended final state, not a human review; human review will rationalize the defensive patches as "still fine."

A useful secondary metric is the variance of the agent's plan across replays of the same task. A watcher loop produces high plan variance — the agent reaches different end states from different runs because it reacts to different shadows. As the synchronization improves, the variance collapses, and the agent's outputs become reproducible in the way a deterministic compiler is reproducible.

The Architectural Shift Most Teams Skip

The instinct on first encountering this is to debug the agent: better prompts, smarter loop detection, a longer think step. None of that addresses the root cause. The root cause is that "tight feedback loop" was a developer-ergonomics property of a single-participant system, and the team turned the system into a multi-participant one without renegotiating the protocol.

A dev server's watcher contract is, implicitly: one human saves files at human speed; reload to reflect what they saved. When the second participant is an agent that writes at machine speed and reads the watcher's reactions into its own reasoning, the contract is silently violated on both sides. The agent's writes outpace the watcher's debounce; the agent's reads consume the watcher's output as if it were a stable observation rather than a noisy heuristic. Neither party knows the contract changed.

The work is to make the contract explicit. Decide which participant is the producer of edits and which is the consumer of state. Decide what counts as a committed change vs an in-flight scratch. Decide what state the agent is allowed to observe and through what filter. The synchronization primitive doesn't have to be heavy — a marker file, an IPC channel, an in-process gate — but it has to exist, and someone on the team has to own it.

The teams that ship coding agents into live dev environments without naming this contract are not shipping coding agents. They are shipping producer-consumer races whose cost shows up as wasted tokens, defensive diffs, and reviewers wondering why this PR is so much weirder than the description implied. The fix is cheap once you see it. The hard part is realizing the participants in your dev loop are not who they used to be.

References:Let's stay in touch and Follow me for more thoughts and updates