The Expertise Cliff: Why AI Coding Agents Fail in Mature Codebases
A 2025 controlled trial gave experienced developers access to AI coding tools and measured whether they got faster. The developers predicted a 24% speedup. After completing the study, they reported feeling roughly 20% faster. Objective measurement showed they were actually 19% slower.
This isn't a story about AI hype. It's a story about tacit knowledge — the undocumented "why" that lives inside every mature codebase and cannot be recovered by reading the code alone. AI agents are remarkably productive in greenfield systems precisely because there is little tacit knowledge to violate. They degrade in mature codebases for exactly the same reason.
What Tacit Knowledge Actually Means in a Codebase
Michael Polanyi's observation — "we can know more than we can tell" — has an obvious application in software engineering. A senior engineer working in a system she has maintained for three years holds knowledge that will never appear in any documentation:
- The invariant that this cache layer assumes single-writer semantics, which is why there are no compare-and-swap locks anywhere in the module (adding them would be "correct" but introduce phantom contention).
- The reason the auth middleware is structured in three layers instead of two, which was a deliberate decision after a 2022 incident involving token replay across session boundaries.
- The fact that this seemingly redundant secondary index exists because an ORM migration in 2021 corrupted the primary key space, and removing it would silently break three quarterly reports.
None of these facts appear in code comments. They live in old pull request threads, in Slack conversations that have since been archived, in a Notion page someone wrote in 2022 and never linked from anywhere, and in the heads of the engineers who were there.
When an AI agent reads this codebase, it sees patterns. It does not see decisions. It cannot distinguish between code that is structured the way it is because of a constraint and code that is structured that way by accident. It defaults to the most statistically common resolution — which is often the right thing in greenfield code and often the wrong thing here.
The Failure Modes Are Systematic, Not Random
AI coding agents fail in mature codebases in recognizable ways. These are not one-off bugs. They are predictable consequences of statistical pattern-matching applied to codebases with implicit constraints.
Constraint violations that compile. AI suggestions are frequently syntactically correct, pass type checks, and violate an invariant that was never encoded in the type system because the team did not know they needed to. The code ships. The invariant is broken. The failure mode is observable months later in production.
Reversion to common-pattern defaults. When facing an unusual architecture — one built around non-obvious constraints — AI defaults to the most common equivalent pattern from its training data. A team running an event-sourced system will find AI suggestions that introduce mutable state. A team with strict idempotency requirements will find AI suggestions that introduce side effects.
Copy-paste amplification. GitClear's analysis of 211 million lines of code found that 2024 was the first year copy-pasted lines exceeded refactored lines in AI-assisted codebases. AI agents replicate existing patterns without understanding whether those patterns are intentional or incidental. Incidental patterns get canonized as AI reproduces them at scale.
Counter-intuitive constraint failure. Empirical work on constrained code generation finds that constraints which match common training-data patterns achieve 99%+ compliance. Constraints that run counter to common patterns fail at rates between 10% and 100%. A constraint like "always validate user input at the boundary" is common; AI follows it. A constraint like "never validate at the application layer because our load balancer already strips invalid input" is unusual; AI will violate it because every training example it has seen validates at the application layer.
The practical consequence: you cannot fix this by adding more comments or being clearer in your prompts about individual suggestions. The problem is structural. The knowledge that governs constraint-relevant decisions is not in the repository.
- https://arxiv.org/html/2510.16395v1
- https://arxiv.org/html/2506.12347v1
- https://arxiv.org/html/2601.15153
- https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
- https://addyosmani.com/blog/comprehension-debt/
- https://addyosmani.com/blog/automated-decision-logs/
- https://martinfowler.com/articles/reduce-friction-ai/
- https://martinfowler.com/bliki/ArchitectureDecisionRecord.html
- https://adr.github.io/
- https://docs.bswen.com/blog/2026-04-04-tacit-knowledge-developers/
- https://arxiv.org/abs/2604.07192
- https://leaddev.com/technical-direction/how-ai-generated-code-accelerates-technical-debt
- https://www.sonarsource.com/blog/the-inevitable-rise-of-poor-code-quality-in-ai-accelerated-codebases
