Skip to main content

The Unmergeable Agentic Refactor: Why Multi-File Diffs Break at the Seam

· 9 min read
Tian Pan
Software Engineer

A 40-file refactor from a coding agent lands on your desk. You open the PR, scroll through the diff, and every hunk looks fine. The rename is consistent, the imports are tidy, the tests compile in isolation. You merge. Forty minutes later, CI on main goes red because two call sites in a sibling package still pass three arguments to a function that now takes four, and the type checker that would have caught it was never part of the agent's inner loop.

This is the most common failure mode in agent-authored refactors today, and it has almost nothing to do with the quality of the individual edits. Each file, reviewed on its own, looks like something a careful human would have written. The bug lives at the seams — the boundaries where edits from different files have to agree. File-level review hides seam-level correctness, and most review workflows were designed around files.

The Agent Sees Hunks, Not Graphs

Human refactoring tools — IntelliJ's Rename, rust-analyzer's change-signature, TypeScript's project-wide references — operate on a resolved symbol graph. When you rename getUser to fetchUser, the tool doesn't search for the string; it walks the AST, finds every reference bound to that specific symbol, and rewrites them. If any site can't be updated mechanically, the tool refuses and shows you a list.

Coding agents, by default, do none of this. They operate on a sequence of textual edits generated from a prompt and a window of files. The symbol graph exists inside the compiler; the agent sees a flattened, chunked view of source. Research on AST-guided generation explicitly frames this gap: LLMs optimize local likelihood, not global semantics, and thus fail to guarantee layered correctness. Syntactic plausibility per hunk is what the model is good at. Cross-file agreement is what the model cannot see.

The consequence is a specific and repeatable bug class. A signature is changed in the definition file. Two call sites in the same file are updated because they're in context. A third call site in a sibling test file is missed because the relevant file wasn't loaded. A fourth call site lives in a separately-packaged consumer that the agent didn't even know to search. Each individual edit is locally plausible. Only the union is wrong.

Why Review Breaks Here Too

The natural instinct is to push the problem onto review. Reviewers catch bugs humans make; let them catch bugs agents make. This does not work, for reasons rooted in how humans read diffs.

Code review is a hunk-level cognitive task. The reviewer looks at a block of changes, asks "does this edit make sense given the surrounding context shown in the hunk?" and approves or comments. Research on AI-assisted review has converged on hunk-level feedback precisely because it is the granularity at which humans can actually reason. The problem is that seam bugs are invisible at that granularity. "Each hunk looks fine" is not a failure of attention; it is a correct observation that becomes wrong only when you compose the hunks.

A forty-file PR compounds the problem. The reviewer now has to hold the rename's full semantics in working memory across every hunk, cross-referencing definitions to call sites to tests to imports. A skilled reviewer can do this for a ten-file diff. Nobody does it well for forty. The reviewer ends up sampling — reading a few hunks carefully, skimming the rest, and trusting that if the compiler were going to complain it would have complained already. When the compiler isn't in the loop, that trust is misplaced.

The Compiler Is the Cheapest Reviewer

The fastest escape from this bug class is to make the compiler part of the agent's loop, not a post-hoc gate after the agent has already written the PR. The distinction matters: a compile-first loop rejects invalid intermediate states and forces the agent to fix them before producing the next edit. A compile-as-gate workflow catches the same errors later, but only after the agent has produced a full diff it thinks is correct.

In practice, the compile-first discipline looks like this. After any multi-file edit, the agent runs the type checker (or equivalent static analyzer for the language). If it fails, the failure output is fed back into the next turn. The agent does not get to produce a PR that doesn't typecheck. For TypeScript, this means tsc --noEmit on every edit round. For Rust, cargo check. For Python with a typed codebase, mypy or pyright. Recent writeups of agent-operated CI pipelines describe exactly this layering — a strict typecheck pass runs as an inner gate, not a later one.

This is not sufficient for untyped languages, and it is not sufficient for semantic bugs in typed ones. A renamed constant that still compiles because both the old and new names are defined will pass typecheck and break at runtime. But compile-first catches the large majority of seam bugs — missed call sites after signature changes, stale imports, deleted symbols still referenced in tests — and catches them at the moment the agent can cheaply correct them.

Use the Refactoring Tool, Not the Text Editor

The second discipline is harder culturally and more important practically: stop treating refactors as text generation. A rename is not a prompt-and-diff task. It is a mechanical transformation that language servers have been doing correctly since 2005. Modern IDE tooling exposes these transformations — VSCode's "change signature" API, ast-grep's pattern-based rewrites, language-server rename-symbol — and agents that invoke them get the correctness guarantees for free.

The anti-pattern is an agent that reads a file, notices a symbol, and asks the model to "update all usages." That path produces the seam bug every time. The pattern that works is an agent that recognizes "this is a rename" or "this is a signature change" and dispatches to a deterministic tool that operates on the symbol graph. Kiro's writeup on program-analysis-backed refactoring frames this explicitly: LLMs for intent recognition, program analysis for the transformation. The division of labor is the point.

The resistance to this approach usually comes from the observation that language-server refactors don't compose — you can't easily script a chain of renames-plus-moves-plus-signature-changes through the IDE. This is true, and it is why agent frameworks are starting to expose these operations as tools with stable APIs. The direction of travel is clear: the agent chooses the refactoring, the tool performs it, and the model never sees the text-level diff.

Seam-Aware Review: Show Me All the Call Sites

For the cases where text generation is unavoidable — because the codebase is in a language without good tooling, because the refactor is structural enough that no pre-built transformation fits, or because the edit is new code rather than a rename — the review discipline has to shift from hunk-level to seam-level.

Seam-aware review means the reviewer never approves a signature change without seeing every call site in the diff. For a renamed symbol, the reviewer asks for the grep output across the repo (not just the PR) and confirms that every hit is either updated or explicitly out of scope. For a changed return type, the reviewer looks at every consumer. This is the kind of work that a human does badly and a tool does perfectly. The right CI layer runs these consistency checks automatically and surfaces them as diff annotations: "this signature changed; here are the three call sites in the PR and the two call sites outside the PR that weren't touched."

A growing class of review agents now does exactly this — Anthropic's recent code-review agent deployments and the various multi-agent PR-review toolkits all include some version of a "consistency scan" that flags mismatches between a symbol's definition and its references across the whole repository. This is the right shape for the tool: the human reviews intent, the tool verifies invariants, and seam bugs are caught without anyone having to hold the forty-file rename in their head.

The Uncomfortable Migration

The picture that emerges is a migration from one kind of collaboration to a different one. In the old model, the agent writes and the human reviews the diff. In the new model, the agent writes, deterministic tooling verifies global invariants, and the human reviews intent and judgment calls. Humans are the worst available checker for a consistency bug that spans eight files in a forty-file PR. Type checkers, language servers, and cross-reference linters are nearly perfect checkers for the same class of bug, cost almost nothing to run, and never get tired on file thirty-seven.

The teams that have made this shift report a specific and predictable payoff: their agentic refactors merge. The PRs that used to sit open for three days while a human slowly walked through every file now land in an hour because the inner loop caught the seam bugs before the PR existed. The reviewers spend their time on the questions they can actually answer — is this the right refactor, does it match the architectural direction, is there a subtler bug that tooling can't see — and not on the question of whether every call site was updated.

The teams that haven't made this shift are producing a specific pathology: large, locally-plausible, globally-inconsistent PRs that reviewers can't safely approve and agents can't safely revise without starting over. The fix is not better prompts or bigger context windows. The fix is admitting that refactoring is a program-analysis problem, putting program-analysis tools in the agent's loop, and letting the model do the one thing it is actually good at, which is deciding what the refactor should be in the first place.

References:Let's stay in touch and Follow me for more thoughts and updates