AI Coding Agents on Legacy Codebases: Why They Fail Where You Need Them Most
The teams that most urgently need AI coding help are usually not the ones building new greenfield services. They're the ones maintaining 500,000-line Rails monoliths from 2012, COBOL payment systems that have processed billions of transactions, or microservice meshes where the original architects left three acquisitions ago. These are the codebases where a single misplaced refactor can introduce a silent data corruption bug that surfaces three weeks later in production.
And this is exactly where current AI coding agents fail most spectacularly.
The frustrating part is that the failure mode is invisible until it isn't. The agent produces code that compiles, passes existing tests, and looks reasonable in review. The problem surfaces in staging, in the nightly batch job, or in the edge case that only one customer hits on a specific day of the month.
The Greenfield Benchmark Trap
Modern AI coding agents are benchmarked on tasks like HumanEval (implement a function from a docstring) and SWE-bench (fix an isolated GitHub issue). These benchmarks have become nearly saturated — frontier models score above 80% on SWE-bench Verified, and HumanEval is effectively solved.
But a recently published benchmark called SWE-EVO measures something different: long-horizon software evolution on real legacy systems. The best current models resolve only 21% of its tasks, compared to 65% on SWE-bench Verified. That gap — 21% vs. 65% — captures something real about how agents degrade when they leave the controlled, self-contained world of isolated bug reports.
The reason is context. Greenfield code is self-documenting in a mechanical sense: conventions are explicit, the dependency graph fits in a context window, and there's no tribal knowledge to miss. Legacy code is the opposite. The actual constraints on what a function can safely do are scattered across incident postmortems, Slack threads, comments from 2017, and the memory of a developer who left in 2020.
AI agents are trained on code that works. Not code that survives.
What "Plausible-but-Wrong" Looks Like in Practice
A CodeRabbit analysis of 470 open-source pull requests found that AI-authored code produces 1.7× more issues per PR than human-authored code. The breakdown is revealing: logic errors are 75% more common, security vulnerabilities appear at 2.74× the rate, I/O performance problems are 8× more frequent, and concurrency/dependency errors roughly double. These aren't the categories where a quick glance during review catches the problem. They're the categories where the bug lives quietly in production for weeks.
The failure modes cluster into a few recognizable patterns:
Undocumented invariants. Every mature codebase has functions that must only be called under specific conditions — after a particular initialization sequence, never from multiple goroutines, only with non-nil arguments in certain configurations. These invariants aren't in the function signature. They're in the head of whoever wrote it. An agent moving that function to a new module sees clean code and clean type signatures. It doesn't see the invariant.
Implicit dependency ordering. Service A writes to a queue. Service B reads from it. Service A happens to flush before Service B checks. Nobody documented this temporal dependency because it was always true and never broke. An agent refactoring the flush logic changes the timing. Now Service B sees empty queues intermittently, under load, in production.
Test gaps that don't look like test gaps. Legacy code often has 60–80% line coverage that means almost nothing, because the tests were written to cover lines rather than behavior. An agent that passes the test suite has validated very little about whether it preserved the system's actual behavior. Meta discovered that only about 5% of their 4,100+ pipeline modules had accessible context for agents before they built a pre-compute system to document implicit knowledge.
Over-abstraction under the illusion of DRY. When two functions look similar, agents merge them. Often they're similar for accidental reasons — the underlying domain concepts are different, and the similarity will diverge as requirements evolve. The merged version fills up with conditionals. Duplicate code rates in AI-heavy projects jump from 3.1% to 14.2%, and average file sizes nearly double. The code is "cleaner" and harder to change.
Why the Standard Safeguards Don't Catch These Bugs
The obvious response is: code review. But legacy codebases are exactly where code review breaks down for AI-generated changes. Reviewers depend on familiarity with the context the agent doesn't have — the same tribal knowledge that's absent from the agent's inputs is also absent from the reviewer's mental model of what to check.
Incidents per PR increased 23.5% and change failure rate rose 30% in the twelve months after AI coding assistants became mainstream. Static analysis warnings in agent-heavy repositories grew 18%. Cognitive complexity scores rose 39%. These are not the metrics of a technology that's working as advertised.
The standard linting and type-checking gates don't help because the errors aren't syntactic or type-level. They're semantic — the code does something different than what the system needs, in ways that only emerge when the system runs.
Scaffolding Patterns That Actually Work
The teams making progress on this problem have stopped treating AI agents as autonomous engineers and started treating them as very fast junior engineers who need structured guardrails. The discipline is architectural, not prompt-engineering.
Scope gates on every PR. Cap agent-generated PRs to 20 files and 500 added lines. This isn't about limiting the agent's capability — it's about keeping the blast radius small and keeping human review tractable. Large AI commits obscure serious issues. Break large refactors into reviewable chunks, even if the agent could produce the whole thing at once.
Read-only analysis before any generation. Before an agent touches code, require it to produce a dependency map, identify all call sites of the functions it plans to modify, and enumerate the invariants it can discover from comments, tests, and commit history. This phase catches the "I didn't know that" problems before they become code. It also surfaces cases where the agent doesn't have enough context to proceed safely — those cases should stop, not proceed with a guess.
Baseline tests before refactoring. The biggest risk from AI refactoring isn't messy code. It's silent behavior change — the function now returns something slightly different under edge conditions, and the existing tests don't cover those conditions. Run characterization tests (tests that document current behavior, not correct behavior) before allowing any agent-driven refactor. If you can't write characterization tests because the code is too tangled, that's a signal the agent shouldn't be touching it yet.
Shadow runs for high-risk changes. For changes that affect shared infrastructure, deploy the new version as a shadow — accepting real traffic, producing real outputs, but not acting on them. Compare shadow outputs to production outputs before cutting over. This is expensive to set up, but it's the only way to validate behavioral preservation for code with insufficient test coverage.
Pre-compute tribal knowledge explicitly. Meta's approach — building a system of 50+ specialized agents to document implicit knowledge across their pipeline codebase — brought context coverage from 5% to 100% of modules. The result was 40% fewer tool calls per task as agents stopped needing to rediscover context at runtime. The investment pays off in agent accuracy, not just in documentation quality.
Scope the agent's write permissions. Use allowlists on file paths agents can modify. Infrastructure configuration, database migration files, and authentication code should require explicit human initiation. An agent that can't accidentally modify a migration file can't accidentally introduce a data-destructive change.
The Code Health Precondition
One finding from practitioners is consistent: AI agents perform worse on low-quality code, not just on large codebases. Functions that are too long, classes with too many responsibilities, and modules with high cyclomatic complexity are harder for agents to reason about correctly — just as they're harder for humans to reason about.
This creates a useful precondition check. Before allowing an agent to modify a module, measure its code health score. If it's below a threshold (one team uses 8.5 out of 10), require human-driven cleanup first. The cleanup improves agent accuracy on the subsequent automated work, and it reduces the surface area where the agent's "plausible-but-wrong" errors can hide.
CodeScene's Adam Tornhill summarizes this as: "Speed amplifies both good and bad design decisions." An agent working on well-structured code with good test coverage and explicit documentation produces good changes quickly. An agent working on tangled, implicit, underdocumented code produces tangled, implicit, underdocumented changes — faster.
What the Benchmark Gap Is Telling You
The 44-point gap between SWE-bench Verified performance (65%) and SWE-EVO performance (21%) for the best current models reflects something the industry needs to be honest about: we don't have agents that understand legacy systems. We have agents that can correctly handle isolated, well-specified changes to code that is mostly self-documenting.
That's not useless. Isolated, well-specified changes happen constantly in mature codebases. An agent that can reliably implement a new endpoint that follows existing patterns, or add a new column to a model and update all the dependent read paths, or write tests for a well-understood utility function — that agent saves real time.
The mistake is scope. Agents are best on the well-bounded tasks where a senior engineer would say "this is basically mechanical." They're dangerous on the tasks where a senior engineer would say "I need to think carefully about the downstream effects." The dangerous tasks are often the ones where business pressure makes AI assistance most tempting.
The Forward View
The honest state of the art in 2026 is that AI coding agents are useful on legacy systems only when surrounded by sufficient scaffolding that the scaffolding does a significant fraction of the risk management work. That scaffolding — scope gates, read-only analysis phases, characterization tests, shadow deployments, explicit tribal knowledge documentation — is real engineering work. Teams that skip it because the agent seems confident are the ones filing the post-mortems.
The path forward is not agents that are smarter about legacy code, though that will eventually come. It's making legacy codebases more agent-legible: explicit invariants, documented dependencies, behavioral test coverage that actually reflects what the system is supposed to do. This is work that benefits human engineers too, which suggests it's worth doing regardless of whether agents ever fully close the gap.
The teams most likely to benefit from AI coding assistance in mature systems are not the ones who adopt agents earliest. They're the ones who invest in making their codebase legible — to agents and to the next human engineer who joins the team.
- https://stackoverflow.blog/2026/01/28/are-bugs-and-incidents-inevitable-with-ai-coding-agents/
- https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
- https://engineering.fb.com/2026/04/06/developer-tools/how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines/
- https://electric-sql.com/blog/2026/02/02/configurancy
- https://codescene.com/blog/agentic-ai-coding-best-practice-patterns-for-speed-with-quality
- https://kiro.dev/blog/refactoring-made-right/
- https://altersquare.io/ai-generated-code-next-refactor-will-prove-its-not-clean/
- https://arxiv.org/html/2512.18470v2
- https://www.codegeeks.solutions/blog/legacy-code-modernization-using-ai
