The Unmergeable Agentic Refactor: Why Multi-File Diffs Break at the Seam
A 40-file refactor from a coding agent lands on your desk. You open the PR, scroll through the diff, and every hunk looks fine. The rename is consistent, the imports are tidy, the tests compile in isolation. You merge. Forty minutes later, CI on main goes red because two call sites in a sibling package still pass three arguments to a function that now takes four, and the type checker that would have caught it was never part of the agent's inner loop.
This is the most common failure mode in agent-authored refactors today, and it has almost nothing to do with the quality of the individual edits. Each file, reviewed on its own, looks like something a careful human would have written. The bug lives at the seams — the boundaries where edits from different files have to agree. File-level review hides seam-level correctness, and most review workflows were designed around files.
The Agent Sees Hunks, Not Graphs
Human refactoring tools — IntelliJ's Rename, rust-analyzer's change-signature, TypeScript's project-wide references — operate on a resolved symbol graph. When you rename getUser to fetchUser, the tool doesn't search for the string; it walks the AST, finds every reference bound to that specific symbol, and rewrites them. If any site can't be updated mechanically, the tool refuses and shows you a list.
Coding agents, by default, do none of this. They operate on a sequence of textual edits generated from a prompt and a window of files. The symbol graph exists inside the compiler; the agent sees a flattened, chunked view of source. Research on AST-guided generation explicitly frames this gap: LLMs optimize local likelihood, not global semantics, and thus fail to guarantee layered correctness. Syntactic plausibility per hunk is what the model is good at. Cross-file agreement is what the model cannot see.
The consequence is a specific and repeatable bug class. A signature is changed in the definition file. Two call sites in the same file are updated because they're in context. A third call site in a sibling test file is missed because the relevant file wasn't loaded. A fourth call site lives in a separately-packaged consumer that the agent didn't even know to search. Each individual edit is locally plausible. Only the union is wrong.
Why Review Breaks Here Too
The natural instinct is to push the problem onto review. Reviewers catch bugs humans make; let them catch bugs agents make. This does not work, for reasons rooted in how humans read diffs.
Code review is a hunk-level cognitive task. The reviewer looks at a block of changes, asks "does this edit make sense given the surrounding context shown in the hunk?" and approves or comments. Research on AI-assisted review has converged on hunk-level feedback precisely because it is the granularity at which humans can actually reason. The problem is that seam bugs are invisible at that granularity. "Each hunk looks fine" is not a failure of attention; it is a correct observation that becomes wrong only when you compose the hunks.
A forty-file PR compounds the problem. The reviewer now has to hold the rename's full semantics in working memory across every hunk, cross-referencing definitions to call sites to tests to imports. A skilled reviewer can do this for a ten-file diff. Nobody does it well for forty. The reviewer ends up sampling — reading a few hunks carefully, skimming the rest, and trusting that if the compiler were going to complain it would have complained already. When the compiler isn't in the loop, that trust is misplaced.
The Compiler Is the Cheapest Reviewer
The fastest escape from this bug class is to make the compiler part of the agent's loop, not a post-hoc gate after the agent has already written the PR. The distinction matters: a compile-first loop rejects invalid intermediate states and forces the agent to fix them before producing the next edit. A compile-as-gate workflow catches the same errors later, but only after the agent has produced a full diff it thinks is correct.
- https://www.augmentcode.com/tools/cursor-ai-limitations-why-multi-file-refactors-fail-in-enterprise
- https://www.augmentcode.com/tools/enterprise-multi-file-refactoring-why-ai-breaks-at-scale
- https://kiro.dev/blog/refactoring-made-right/
- https://www.gocodeo.com/post/how-ai-coding-models-handle-context-switching-and-multi-file-refactoring
- https://arxiv.org/html/2508.11126v1
- https://www.coderabbit.ai/blog/ai-native-universal-linter-ast-grep-llm
- https://www.qodo.ai/blog/best-ai-code-review-tools-2026/
- https://www.qodo.ai/blog/shift-left-code-review/
- https://arxiv.org/html/2508.01473v1
- https://samanvya.dev/blog/claude-code-pr-review-toolkit
- https://devops.com/anthropic-code-review-dispatches-agent-teams-to-catch-the-bugs-that-skim-reads-miss/
