AI Coding Agents on Legacy Codebases: Why They Fail Where You Need Them Most
The teams that most urgently need AI coding help are usually not the ones building new greenfield services. They're the ones maintaining 500,000-line Rails monoliths from 2012, COBOL payment systems that have processed billions of transactions, or microservice meshes where the original architects left three acquisitions ago. These are the codebases where a single misplaced refactor can introduce a silent data corruption bug that surfaces three weeks later in production.
And this is exactly where current AI coding agents fail most spectacularly.
The frustrating part is that the failure mode is invisible until it isn't. The agent produces code that compiles, passes existing tests, and looks reasonable in review. The problem surfaces in staging, in the nightly batch job, or in the edge case that only one customer hits on a specific day of the month.
The Greenfield Benchmark Trap
Modern AI coding agents are benchmarked on tasks like HumanEval (implement a function from a docstring) and SWE-bench (fix an isolated GitHub issue). These benchmarks have become nearly saturated — frontier models score above 80% on SWE-bench Verified, and HumanEval is effectively solved.
But a recently published benchmark called SWE-EVO measures something different: long-horizon software evolution on real legacy systems. The best current models resolve only 21% of its tasks, compared to 65% on SWE-bench Verified. That gap — 21% vs. 65% — captures something real about how agents degrade when they leave the controlled, self-contained world of isolated bug reports.
The reason is context. Greenfield code is self-documenting in a mechanical sense: conventions are explicit, the dependency graph fits in a context window, and there's no tribal knowledge to miss. Legacy code is the opposite. The actual constraints on what a function can safely do are scattered across incident postmortems, Slack threads, comments from 2017, and the memory of a developer who left in 2020.
AI agents are trained on code that works. Not code that survives.
What "Plausible-but-Wrong" Looks Like in Practice
A CodeRabbit analysis of 470 open-source pull requests found that AI-authored code produces 1.7× more issues per PR than human-authored code. The breakdown is revealing: logic errors are 75% more common, security vulnerabilities appear at 2.74× the rate, I/O performance problems are 8× more frequent, and concurrency/dependency errors roughly double. These aren't the categories where a quick glance during review catches the problem. They're the categories where the bug lives quietly in production for weeks.
The failure modes cluster into a few recognizable patterns:
Undocumented invariants. Every mature codebase has functions that must only be called under specific conditions — after a particular initialization sequence, never from multiple goroutines, only with non-nil arguments in certain configurations. These invariants aren't in the function signature. They're in the head of whoever wrote it. An agent moving that function to a new module sees clean code and clean type signatures. It doesn't see the invariant.
Implicit dependency ordering. Service A writes to a queue. Service B reads from it. Service A happens to flush before Service B checks. Nobody documented this temporal dependency because it was always true and never broke. An agent refactoring the flush logic changes the timing. Now Service B sees empty queues intermittently, under load, in production.
Test gaps that don't look like test gaps. Legacy code often has 60–80% line coverage that means almost nothing, because the tests were written to cover lines rather than behavior. An agent that passes the test suite has validated very little about whether it preserved the system's actual behavior. Meta discovered that only about 5% of their 4,100+ pipeline modules had accessible context for agents before they built a pre-compute system to document implicit knowledge.
Over-abstraction under the illusion of DRY. When two functions look similar, agents merge them. Often they're similar for accidental reasons — the underlying domain concepts are different, and the similarity will diverge as requirements evolve. The merged version fills up with conditionals. Duplicate code rates in AI-heavy projects jump from 3.1% to 14.2%, and average file sizes nearly double. The code is "cleaner" and harder to change.
Why the Standard Safeguards Don't Catch These Bugs
The obvious response is: code review. But legacy codebases are exactly where code review breaks down for AI-generated changes. Reviewers depend on familiarity with the context the agent doesn't have — the same tribal knowledge that's absent from the agent's inputs is also absent from the reviewer's mental model of what to check.
Incidents per PR increased 23.5% and change failure rate rose 30% in the twelve months after AI coding assistants became mainstream. Static analysis warnings in agent-heavy repositories grew 18%. Cognitive complexity scores rose 39%. These are not the metrics of a technology that's working as advertised.
The standard linting and type-checking gates don't help because the errors aren't syntactic or type-level. They're semantic — the code does something different than what the system needs, in ways that only emerge when the system runs.
- https://stackoverflow.blog/2026/01/28/are-bugs-and-incidents-inevitable-with-ai-coding-agents/
- https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
- https://engineering.fb.com/2026/04/06/developer-tools/how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines/
- https://electric-sql.com/blog/2026/02/02/configurancy
- https://codescene.com/blog/agentic-ai-coding-best-practice-patterns-for-speed-with-quality
- https://kiro.dev/blog/refactoring-made-right/
- https://altersquare.io/ai-generated-code-next-refactor-will-prove-its-not-clean/
- https://arxiv.org/html/2512.18470v2
- https://www.codegeeks.solutions/blog/legacy-code-modernization-using-ai
