AI Coding Agents on Legacy Codebases: What Works and What Backfires
Most AI coding demos show an agent building a greenfield Todo app or implementing a clean API from scratch. Your codebase, however, is a fifteen-year-old monolith with undocumented implicit contracts, deprecated dependencies that three teams depend on in ways nobody fully understands, and a service layer that started as a single class and now spans forty files. The gap between demo and reality is not just a size problem — it's a structural one, and understanding it before you hand your agents the keys prevents a specific category of subtle, expensive failures.
AI coding agents genuinely help with legacy systems, but only within certain task boundaries. Outside those boundaries, they don't just fail noisily — they produce plausible-looking, syntactically valid, semantically wrong changes that slip through code review and surface in production.
The Context Window Is the Wrong Mental Model
The first mistake teams make when deploying AI agents on large codebases is treating the context window as the only constraint. The reasoning goes: "We have a 1M-token context window; our codebase is 2M tokens of source; we'll just feed in the relevant parts." That arithmetic misses something important — models don't use context uniformly.
Research on context utilization shows that LLM performance degrades significantly as context length grows, with a pronounced "lost in the middle" effect: the model reliably attends to information at the beginning and end of the context window but loses track of details buried in the middle. For a legacy codebase, that means feeding in the "relevant" files often buries the most critical context — the implicit constraint in a utility function three layers deep — exactly where the model stops paying attention.
The working constraint isn't "can we fit the codebase in context?" It's "can we fetch the right context, in the right order, for this specific task?" Those are fundamentally different problems. The first is a hardware problem you can throw tokens at. The second is a retrieval quality problem that requires deliberate engineering.
Retrieval-augmented generation over codebases partially solves this. Without RAG, agents freely hallucinate internal API calls that have never existed. With RAG, they're constrained to patterns that appear in retrieved code. CodeRAG-Bench research shows that when high-quality context is provided, generation quality improves substantially — but the retrieval layer often fails to fetch genuinely useful context, especially for cross-file relationships. A model that retrieves the wrong snippets with high confidence produces hallucinated APIs that look correct because they're plausible given the code style.
The practical implication: treat RAG over your codebase as a necessary mitigation for hallucinations, not a solved problem. Audit your retrieval quality explicitly by testing whether queries about your actual internal APIs return the files that define those APIs, not just files that call them.
The Task Scope Problem
The tasks where AI coding agents on legacy code succeed consistently are narrower than the marketing materials suggest.
High-confidence task types:
-
Writing unit tests for an isolated module. The agent sees the function, generates test cases for the happy path and edge cases, and produces characterization tests that capture existing behavior. This is the most reliably valuable use case on legacy code. It doesn't require understanding the system's global invariants — just the local contract.
-
Explaining what a function does. Documentation generation from legacy code is one of the better applications because it's read-only. The agent can be wrong in a documentation comment without breaking production behavior.
-
Extracting a utility class from a single file. Small-scope refactors where the extraction target is fully contained within the context window, with no external callers to track, work well. The agent can see the full blast radius of its change.
-
Translating between equivalent patterns. Converting a class-based component to a hook-based one in React, or migrating a callback-style function to async/await — tasks with mechanical structure and a clear mapping — succeed because the semantic intent is preserved by the transformation itself.
High-risk task types:
-
Cross-cutting changes. Any modification that touches a shared library, a common utility, or an interface used by multiple services creates a context boundary problem. The agent can only reason about the callers it can see. In a large monolith, it typically cannot see all callers. The result is changes that are correct for the subset it analyzed and subtly broken for the rest.
-
Implicit contract modifications. Legacy code is full of undocumented assumptions: "this method is only called from the batch job, so null safety doesn't apply at runtime"; "this field being 0 has a special meaning separate from null." These constraints exist only in engineers' heads and occasionally in incident postmortems. The agent doesn't know they exist. A refactor that looks correct destroys these invariants invisibly.
-
Dependency graph updates. Upgrading a shared library version, changing a method signature in a widely-used utility, or modifying a database schema column — changes where the full consequence graph exceeds what any context window can hold. The agent fixes the immediate caller and silently breaks three downstream services it never saw.
Why AI-Generated Code on Legacy Systems Fails Differently
AI-generated code quality metrics are sobering even in greenfield contexts. Reports indicate AI-generated code contains roughly 1.7x more major issues than human-generated code, with a higher rate of security vulnerabilities and configuration errors. On legacy systems, the failure modes compound.
The core issue is training data recency. LLMs are trained on code indexed at a specific point in time. GitHub Copilot's training cutoff means it has no awareness of API deprecations, CVE disclosures, or library changes after that date. In a legacy system that depends on an older version of a library, the agent may generate code that's correct for the version it was trained on, incorrect for your pinned version, or that "helpfully" upgrades to the latest syntax that your constrained dependency version doesn't support.
The second compound failure is pattern extrapolation. Legacy codebases contain patterns that were idiomatic a decade ago and are now antipatterns. When the agent sees those patterns extensively in the retrieved context, it learns to produce more of them. Rather than modernizing, it faithfully extends the legacy pattern because that's what the local evidence suggests. Teams report agents that write new JDBC boilerplate in codebases that have been partially migrated to an ORM — because the retrieval context pulled files from the old layer.
Tribal knowledge is the third gap. Meta built a system of 50+ specialized agents specifically to extract and document non-obvious patterns — design choices not apparent from the code itself. Before deploying these agents, engineers estimated 40% of their production debugging time was caused by the agent acting on incomplete knowledge of these patterns. Documenting tribal knowledge first, then deploying agents, reduced that significantly. Most teams skip this step.
The Safety Pattern: Characterization Tests First
The most reliable gating mechanism for AI-assisted legacy work is a test-first discipline applied to the code being modified, not the code being generated. Before any agent-assisted refactoring of a legacy module:
- Use the agent to generate characterization tests that capture the current observable behavior of the module — including edge cases and error states.
- Have a human engineer review these tests not for correctness of what the tests assert, but for completeness of what behaviors they capture. This is where tribal knowledge enters: the engineer adds test cases for the implicit constraints the agent couldn't know about.
- Run these tests against the existing code. They should pass. This baseline is the contract.
- Now apply the refactoring. The tests enforce that the external behavior has been preserved.
This pattern works because it separates two concerns: understanding the system (where the agent is unreliable) from transforming it within understood boundaries (where the agent is useful). The characterization tests serve as a formalization of what the human already knows.
Teams that skip this step consistently report higher defect rates. It takes time. It's worth it.
Context Engineering as the Real Constraint
The scalability limit for AI coding agents on large legacy systems isn't the context window size — it's context quality. Two architectural choices make the difference:
Scoped, task-specific context. Rather than loading everything related to the change, load only what's necessary for the specific task. A refactoring task within one module needs that module's code, its direct callers, and the types it uses — not the entire service. Folder-specific or task-specific context files (similar to CLAUDE.md scoped by directory) outperform a single monolithic context file because they encode task-relevant knowledge rather than system-wide background noise.
Pre-task documentation generation. Having an agent read and describe the module before modifying it is surprisingly effective. The generated description, reviewed by a human, surfaces implicit assumptions, identifies dependencies the agent might otherwise miss, and forces a two-pass process — one where the agent builds a model of the code, and one where it modifies within that model. Teams that add this step report fewer cross-cutting breakages.
The organizational analogy is onboarding. You wouldn't ask a new engineer to make a significant change to a legacy system on their first day without first having them read the relevant documentation, talk to the team, and understand the invariants. The same principle applies to agents, except the "talk to the team" step is replaced by curated context documents that encode what the team knows.
The Task-Type Decision Framework
A practical heuristic for deciding whether to deploy an agent on a specific legacy task:
- Blast radius visible in context? If you can load the complete set of files that could be affected by the change and fit them in the context window, the agent can reason about the full impact. Proceed.
- Implicit contracts documented? If the module has characterization tests, or tribal knowledge has been captured in comments or task-specific context, the agent has the constraints it needs. Proceed with review.
- Change spans service boundaries? If the change touches interfaces consumed by other services, or modifies shared library behavior, require human architectural review before any agent-generated change merges. This is not a case where an agent review is sufficient.
- Dependency change involved? Any library version change, database schema modification, or API contract update requires human sign-off. The agent can propose the change; the engineer validates the consequence graph.
The failure mode this framework prevents is the most common one in practice: agents that produce changes that pass static analysis, pass type checking, pass the tests that exist, and still break production behavior because the tests didn't cover an invariant the agent didn't know about.
What Actually Improves Over Time
The teams making the most progress with AI agents on legacy systems share one characteristic: they treat context engineering as an ongoing investment, not a one-time setup. Each task cycle produces artifacts — characterization tests, documentation generated from legacy code, tribal knowledge captured in task-specific context files — that make the next agent run more reliable.
This compound effect is the actual value proposition. The first agent run on a poorly-documented legacy module is risky and requires heavy human oversight. The tenth run, after characterization tests have been written, documentation has been generated and validated, and implicit contracts have been codified, is substantially safer.
The mistake is expecting immediate productivity gains from agents on legacy code. The productivity comes later, after the investment in making the codebase agent-readable. That investment has value independent of AI — it's just documentation and tests — but AI agents make the investment necessary sooner, because the cost of skipping it is higher.
Legacy codebases were not designed for AI agents to read. The teams that succeed treat agent-readability as an explicit engineering property to develop, not an assumption to make.
- https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
- https://code-rag-bench.github.io/
- https://understandlegacycode.com/blog/can-ai-refactor-legacy-code/
- https://github.blog/ai-and-ml/github-copilot/modernizing-legacy-code-with-github-copilot-tips-and-examples/
- https://engineering.fb.com/2026/04/06/developer-tools/how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines/
- https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html
- https://factory.ai/news/context-window-problem
- https://www.augmentcode.com/tools/ai-coding-assistants-for-large-codebases-a-complete-guide
- https://epoch.ai/benchmarks/swe-bench-verified
- https://arxiv.org/html/2512.05239v1
