AI-Assisted Codebase Migration at Scale: Automating the Upgrades Nobody Wants to Touch
When Airbnb needed to migrate 3,500 React test files from Enzyme to React Testing Library, they estimated the project at 1.5 years of manual effort. They shipped it in 6 weeks using an LLM-powered pipeline. When Google studied 39 distinct code migrations executed over 12 months by a team of 3 developers—595 code changes, 93,574 edits—they found that 74% of the edits were AI-generated, 87% of those were committed without human modification, and the overall migration timeline was cut by 50%.
These numbers are real. But so is this: during those same migrations, engineers spent approximately 50% of their time validating AI output—fixing context window failures, cleaning up hallucinated imports, and untangling business logic errors the tests didn't catch. The efficiency gains are genuine and the pain points are genuine. The question isn't whether AI belongs in code migrations; it's knowing exactly where it helps and where it creates more cleanup than it saves.
The Two Fundamentally Different Tools You Need
The first mistake most teams make is treating "AI-assisted migration" as a single category. There are two distinct tool families, and they're best at different things.
AST-based codemods (ast-grep, jscodeshift, GritQL, OpenRewrite, Comby) work at the syntax tree level. They are deterministic—the same input always produces the same output—and they scale perfectly. ast-grep can search and transform millions of lines across a polyglot codebase in seconds. OpenRewrite has 5,000+ recipes for Java, Python, YAML, Terraform, and Kubernetes migrations. These tools don't hallucinate. They also don't understand what your code is doing—they transform structure, not semantics.
LLM-based agents (Claude Code, Copilot, custom pipelines built on frontier models) understand semantics. They can migrate a callback-style API to a promise-based one while preserving the intent of surrounding business logic. They can read inline comments and coding style and produce code that fits the codebase's conventions. They also sometimes invent imports that don't exist, produce nondeterministic output for the same prompt, and fail silently when the codebase exceeds their context window.
The practical conclusion: use AST tools for detection and structural transformation, use LLMs for semantic transformation where the structure change requires understanding meaning. The hybrid approach—deterministic engine for matching patterns, LLM for rewriting the matched section—combines the reliability of the former with the contextual intelligence of the latter.
Where AI Migration Is Genuinely 10x Faster
Not all migrations are equal. Some task classes have a profile that makes AI assistance straightforwardly correct:
Mechanical API replacements with well-defined transformation rules. React's deprecated componentWillMount → componentDidMount transition. React Testing Library's analogues for Enzyme's .find() and .simulate(). Next.js codemods for the Pages Router → App Router migration. These have documented transformation patterns, the new API surface is well-understood, and tests immediately tell you whether the transformation was correct. AI succeeds on these because the mapping is clear and the test signal is tight.
Test framework migrations. Airbnb's 3,500-file Enzyme migration is the canonical example. Test files rarely contain business logic that the LLM needs to reason about carefully—they contain test setup, assertions, and mocks that follow predictable patterns. The output quality is high, the failure mode is isolated (a broken test doesn't break production), and the test suite itself is the validation mechanism.
Language version evolution on clean business logic. Python 2 → 3, Java 8 → 17, TypeScript strict mode adoption. When the code is algorithmic—mostly data transformations, utility functions, domain logic without complex infrastructure—correctness on 80–90% of files with minimal manual intervention is realistic. The remaining 10–20% are edge cases where a human needs to review, not structural failures that propagate.
Bulk mechanical refactors. Renaming a deprecated symbol across 4,000 files. Moving imports. Normalizing whitespace and formatting in a legacy codebase before enabling ESLint. These are deterministic enough that even pure codemods handle them well—the LLM's role is smaller and its hallucination risk correspondingly lower.
Where AI Migration Creates More Work Than It Saves
The failure modes cluster around four categories:
Complex architectural refactorings. Moving from a monolithic service to a modular architecture. Splitting a God class into proper domain objects. Reorganizing multi-module dependencies. LLMs underperform sharply on cross-class reasoning. They lack the global codebase context needed to understand which modules depend on each other implicitly, and they tend to produce code that's locally coherent but globally inconsistent. The cleanup work here can easily exceed what a disciplined manual approach would have cost.
Migrations with domain-specific constraints. Financial calculations where rounding behavior has regulatory implications. Insurance underwriting logic where business rules are embedded in the code rather than documented. Medical device software where behavior semantics are specification-driven. AI can translate the surface syntax correctly while subtly changing semantics in ways that unit tests won't catch—and that might not surface until a customer complaint or an audit.
Large-scale architectural rewrites without redesign. Translating a 200,000-line Delphi application to C# line-by-line doesn't produce a C# application—it produces a Delphi application that compiles on the wrong runtime. AI agents can translate code syntax and patterns accurately, but they don't fix structural problems. If you're migrating without rethinking the architecture, you're automating the perpetuation of the original design mistakes, and the cleanup cost comes later.
Codebases with missing type information. Untyped JavaScript, Python without annotations, legacy C without documentation. LLMs hallucinate types and relationships when the code doesn't make them explicit. One hallucinated type cascades: a fabricated interface definition causes the consumer to pass the wrong shape, which causes a serialization error in production, which causes an outage in a system that seemed completely unrelated to the migration.
The Verification Strategy That Makes This Safe
Regardless of which tools you use, one principle is non-negotiable: every migrated file must pass its existing test suite before a human ever reviews the diff.
Google's production pipeline enforces this. Changes are validated through CI/CD before developer review—developers only see changes that have already passed builds and tests. This turns the review workload from "is this correct?" to "is there anything the tests didn't catch?" That's a much smaller cognitive task, and it's the right human-in-the-loop use.
The test gate has a specific implication for how you sequence migrations: you cannot safely run large-scale AI migration on a codebase with low test coverage. Before attempting a 3,000-file migration, you need to know which files have meaningful test coverage and which don't. Files with coverage go through the automated pipeline. Files without coverage require manual migration or test-first manual migration—you write the tests, then migrate, then verify. Mixing them without distinguishing creates a false signal: the pipeline reports 3,000 files migrated, the tests pass, and you've silently regressed behavior in the 400 files that had no tests.
A validation layer beyond unit tests matters for production systems. Contract testing validates that refactored code adheres to interface expectations when other services depend on the migrated code's API behavior. Cross-family LLM judges—using a different model family to review the AI-generated diff—catch hallucinations that the original model won't flag as wrong.
Managing the Diff Review Bottleneck
The second critical constraint is the review bottleneck. AI agents can produce a 2,000-line diff across 40 files in seconds. Your reviewers cannot review that at the same speed.
Three patterns help:
Stacked PRs with meaningful boundaries. Break the migration into batches with logical boundaries—by module, by file type, by transformation category. A PR titled "migrate auth module: 23 files, 847 lines" is reviewable. A PR titled "migrate everything: 3,241 files" is not. The overhead of creating more PRs is real; it's smaller than the overhead of rubber-stamping a diff nobody actually reviewed.
Risk-based review triage. Not all changes need the same scrutiny. A purely structural transformation that renamed an import across 200 files, that the CI pipeline validated, that a codemod generated deterministically—this needs spot-check review, not line-by-line. A LLM-generated rewrite of a payment processing flow needs full review. Build the distinction into your review process explicitly.
AI-assisted diff summarization. Use a second LLM pass to generate an executive summary of what changed and why before human review begins. This doesn't replace the review—it gives reviewers a map before they navigate the territory. Teams that do this report significantly lower cognitive load on large diffs and better quality feedback because reviewers spend less time understanding structure and more time assessing risk.
Incremental Migration Without Breaking the Build
"Migrate everything in one shot" is the most common way to turn a 6-week project into a 6-month one. The industry-standard alternative is the expand-migrate-contract pattern:
- Expand: Add the new interface alongside the old one. Both exist, nothing breaks.
- Migrate: Move consumers to the new interface. Decommission the old one from each consumer after verifying the new one works.
- Contract: Remove the old interface when it has no remaining consumers.
This means running both old and new code simultaneously for a period. The overhead is real. The alternative—a hard cutover—means your migration either succeeds completely or you roll it back completely. That binary creates pressure to push untested changes through faster than is safe.
Feature flags extend this pattern to production rollouts. The new migrated code exists in the build; it's activated by configuration for a specific percentage of traffic or a specific cohort of users. You can monitor the behavioral delta, catch the edge cases that testing missed, and roll back specific users without reverting the whole deployment.
For data schema migrations specifically: separate code migrations from data migrations entirely. Migrate the code first so it runs correctly against both old and new schemas. Then migrate the data. Then remove the backward-compatibility layer. Never do all three simultaneously.
The Task Classification Decision
Before starting a migration, classify it:
- High confidence (AI-first): Well-documented framework version upgrade with official codemod support, test framework migration, bulk symbol renames, import reorganization on typed code
- Medium confidence (AI with heavy review): API pattern migration with semantic complexity, language version evolution on mixed-quality legacy code, clean business logic in an untyped language
- Low confidence (manual-first or manual-only): Architectural restructuring, domain logic with regulatory constraints, codebases with <50% test coverage, migrations requiring redesign
For high-confidence migrations, the ROI is exceptional. For low-confidence migrations, the ROI can be negative—the validation overhead, hallucination cleanup, and review burden can exceed the cost of doing it manually while producing output of lower quality.
The number that captures this split most honestly comes from Google's study: 87% of AI-generated code committed without human modification. That sounds like near-perfect accuracy. It is, for the migrations Google classified as appropriate for AI-first automation. It says nothing about the migrations they chose not to automate.
What Actually Changes at Scale
The teams that get the most value from AI-assisted migration share a few operational patterns.
They invest in tooling before migrating. A well-tuned codemod or prompt that produces 90% correct output on 100 files extrapolates to 90% correct output on 10,000 files. A poorly tuned one that requires cleanup on 30% of files extrapolates to unmanageable cleanup at scale. The tuning investment pays off multiplicatively.
They track the actual cost on both sides. Migrated files per engineer-hour is visible. Cleanup hours per thousand migrated files is usually not tracked—it gets absorbed into sprint capacity as "reviewing PRs." Making the cleanup cost visible is the prerequisite for making the build-vs-manual decision accurately.
They accept that 50% engineer-time-on-validation is still a good deal. Google's engineers spent half their time validating AI output. That sounds bad until you compare it to the counterfactual: those same engineers spending 100% of their time on manual migration. The accelerator isn't eliminating human involvement—it's shifting the human's role from writing code to judging code, which is faster and which can be done asynchronously against already-validated CI results.
The migrations AI can't do yet—the architectural ones, the semantics-heavy ones, the poorly-documented legacy ones—those aren't getting easier over time. But the ones it can do are getting faster, cheaper, and more reliable. The practical strategy is to capture the wins clearly, measure the costs honestly, and reserve human capacity for the category of migration that still requires it.
If you found this useful, the companion posts on agent idempotency and structured output reliability cover related challenges in production AI systems.
- https://research.google/blog/accelerating-code-migrations-with-ai/
- https://medium.com/airbnb-engineering/accelerating-large-scale-test-migration-with-llms-9565c208023b
- https://arxiv.org/html/2501.06972v1
- https://arxiv.org/html/2511.00160
- https://martinfowler.com/articles/codemods-api-refactoring.html
- https://linearb.io/blog/how-google-uses-ai-to-speed-up-code-migrations
- https://www.aviator.co/blog/llm-agents-for-code-migration-a-real-world-case-study/
- https://ast-grep.github.io/guide/introduction.html
- https://docs.openrewrite.org/
