The Idiom Your Coding Agent Wrote Around Instead Of Using
A senior engineer on a payments team I work with told me a story that I think every team running coding agents will eventually live through. Their codebase has a Result<T, E> wrapper — homegrown, sits in a single core/result.ts file, used in roughly two hundred call sites across the service. New code is expected to thread Result through every function that can fail; throwing is reserved for genuinely unexpected states. It's not enforced by a lint rule. It is the dialect.
Six months into shipping with a coding agent, they audited the diffs the agent had merged. About a third of the new functions ignored Result entirely. The agent had reached for try/catch, returned T | null, thrown Error subclasses with descriptive messages — every one of those choices is correct in some imagined codebase. None of them was correct in this one. The code typechecked. The tests passed. Reviewers approved it because nothing in it looked wrong line by line. But the file the agent touched no longer fit the file it lived next to, and the team had quietly grown a second dialect inside their own service.
This is the failure mode I want to talk about: not bugs, not hallucinations, not lint violations — idiomatic drift. The agent ships code that compiles, runs, and passes tests, in a style your codebase does not speak. Over enough merges, the codebase bifurcates into agent-style zones and human-style zones, and the cost shows up in places no dashboard is watching.
Idiomatic conformance is a quality dimension your tools don't measure
There is an entire stack of automated quality gates between a coding agent and your main branch — typecheckers, linters, formatters, test suites, security scanners, build pipelines. Every one of them is checking properties that can be expressed as rules. Style consistency, the kind that emerges from a team using Result<T, E> everywhere, is not a rule. It's a pattern that exists in two hundred examples and no specification.
A linter cannot tell you "we don't throw in this codebase, we return a Result." Prettier cannot enforce "use the validateOrThrow helper instead of inline if (!x) throw clauses." A typechecker is perfectly happy with both forms. The conventions that make a codebase legible to its own team — the early-return guard clauses, the test-builder DSL with its specific argument order, the way errors are tagged with a string discriminator instead of a class hierarchy — live in the gap between what passes CI and what passes review.
Industry surveys in 2026 put AI-generated code at around 41% of new code shipped across large engineering organizations, most of it merged with limited human review. A 2026 study comparing AI-authored and human-authored pull requests found that AI PRs are markedly less likely to reuse existing helpers and more likely to inline duplicate logic. The aggregate effect is exactly the bifurcation the payments team observed: the codebase grows, the style islands grow with it, and the team's shared mental model gets fragmented by code that nobody flagged because nothing flag-able was wrong.
The reason the gates do not catch this is structural. Linters operate on local syntax. The agent's drift is a relationship between the file it edited and the eighteen files it didn't read. There is no static-analysis tool you can buy that knows your team prefers validateOrThrow over inline guards, because the only signal that this is your preference is the empirical density of validateOrThrow in your repo. The signal is in the corpus, not in the configuration.
Why agents drift toward the textbook
A coding agent generates code by attending to two contexts: the file it is editing (and any files it explicitly pulled in) and the prior of its training data. The training prior is, by construction, the average of public code on the internet. Public code on the internet does not use your Result<T, E>. It throws exceptions. It uses null. It has its own conventions, drawn from a thousand teams, and none of them is yours.
When the agent edits a function in your codebase, it sees the immediate context — the file, maybe a few neighboring files surfaced by retrieval. It does not see the broader corpus of your repo unless something put that corpus in its context. So the model interpolates: a little bit of "what does this file look like" and a lot of "what does code that does X usually look like." For trivial edits — a typo, a rename — the local context dominates. For anything novel — a new function, a new error path, a new test — the textbook prior wins.
This is the same dynamic that practitioner writeups in 2026 keep flagging: AI suggestions confidently introduce textbook patterns that ignore architectural conventions. The model has been trained on far more public code than any team's private code, and absent strong retrieval into your own repo, the public prior dominates the team prior. The agent is not making a mistake. It is generating from a distribution that does not include your team's vocabulary, and the output reflects that distribution faithfully.
The brittle workaround is to write everything down. AGENTS.md, CLAUDE.md, Cursor rules — by mid-2026 these files ship in tens of thousands of public repos, and they help. But they help with the rules you remembered to write. The convention you can't articulate — the reason a senior engineer would refactor a junior's PR even though they can't say why — is the convention the agent will skip, because you didn't tell it about it, because you didn't know you knew it.
The compounding cost of style islands
If the agent's diff merges and the codebase grows by a thousand lines that don't quite fit, the immediate cost is zero. The feature ships. The tests pass. The on-call rotation is unaffected. This is why the failure mode is so insidious — the bill comes due later, and the team that paid it is not always the team that incurred it.
Three concrete costs accumulate. First, cognitive load on every future read. When a developer opens a file and sees a Result<T, E> return type, they know what to do with the value. When they open a file and see T | null, they reach for a different mental model. A codebase with two error-handling dialects forces every future reader to figure out which dialect they're in before they can reason about correctness. The team's collective time-to-comprehension drifts up, slowly, on every file the agent ever touched.
Second, refactor friction multiplies. When you eventually decide to migrate, add a field to your Result type, or instrument every fallible call with a new observability hook, the migration touches the two hundred call sites that use Result — and silently misses the agent-authored code that doesn't. Now your "we updated the error path" change has gaps you cannot find without auditing every function in the repo. The team that wrote the migration thought they were done. The team that pages at 3 AM finds out they weren't.
Third, code review fatigue compounds. A reviewer scanning a 400-line agent PR for correctness has a finite budget for "wait, why didn't this use validateOrThrow?" The first few reviews catch the drift. The hundredth review catches less. By six months in, the team has trained itself to approve code that compiles and passes tests, because that is the only signal the volume permits. Practitioners report that the generation-to-review ratio is now somewhere around 18x — an agent can produce in ten minutes what a careful reviewer covers in an hour — and the equilibrium that ratio settles into is "ship it if CI is green."
The compounding effect is the part that surprises teams. Idiomatic drift is not a single bad PR. It is the slow re-shaping of the codebase's center of gravity. Every agent-authored file that survives a review becomes part of the training context for the next agent edit, because retrieval pulls neighboring files, and neighboring files are increasingly the ones the agent itself wrote. The agent's textbook prior gets reinforced inside your own repo, and the team's idiom gets quietly outvoted.
What it actually takes to close the gap
The patterns that work treat idiomatic conformance as a first-class engineering concern, not as something a smarter model will fix. There are four moves I keep seeing on teams that have actually kept their style coherent under agent-volume churn.
An idiom index, loaded as context. Maintain a short file — call it IDIOMS.md, put it next to AGENTS.md, point the agent at it on every turn — that names your conventions and points to representative call sites. "Error handling: use Result<T, E> from core/result.ts. See services/billing/charge.ts:42 for a representative example." This is the bare minimum: take the idioms you can articulate and write them down where the agent reads. It will not catch the conventions you cannot articulate, but it will catch most of the obvious ones, and the cost of writing it is hours, not weeks.
Retrieval before generation. The teams that ship the most coherent agent diffs run a pre-generation retrieval step: before the agent writes a new function, it queries the codebase for "how do we usually do X" and reads three to five existing examples. This is straightforward to build — semantic search over the repo, plus a few hand-tuned queries — and it shifts the agent's effective context from "this file plus the public prior" to "this file plus your team's idiom." It is the closest thing to a structural fix.
A post-generation idiom audit. After the agent writes a diff, run a second pass — same model, different prompt — that searches the codebase for the patterns most adjacent to the new code and asks: "does this diff use the same idioms?" If it doesn't, the second pass refactors toward the existing pattern, or flags the diff for human attention with a specific question ("this introduces a new error-handling style, is that intentional?"). The post-generation pass is cheap, and it catches a meaningful fraction of the drift that retrieval misses.
Periodic style-island audits. Once a quarter, run a tool that identifies functions whose style is statistically anomalous compared to the rest of the codebase — files where the error handling, the import patterns, the test structure deviate from the median. Treat the results as a refactor backlog, not as PR feedback. The point is not to blame the agent for the drift; the point is to make the drift visible enough that the team can pay it down deliberately instead of discovering it during the next migration.
None of these are silver bullets. Each of them has a maintenance cost, and the cost is not zero. But the alternative is the bifurcation I started this post with, and the bifurcation has a much larger cost denominated in everything from on-call pages to refactor velocity to new-hire ramp time.
Style is shared cognition, not aesthetic preference
The hardest part of this argument to land with leadership is the framing. Style consistency sounds like a thing senior engineers care about for irrational reasons — "we just like it this way." It is not. Style is the compressed encoding of every decision the team has already debated and resolved. When a codebase uses Result<T, E> everywhere, that is the team's shorthand for "we have decided error handling is too important to live in catch blocks, and every fallible call should be visible in the type signature." A new contributor — human or agent — who skips the idiom is not just deviating from a preference. They are forcing the rest of the team to re-litigate a settled question on every code review.
An agent that does not learn the idiom is, in a precise sense, fragmenting the shared cognitive infrastructure of the team. The codebase stops being a single artifact and becomes a federation of dialects, each with its own assumptions, each maintained by whoever last touched it. The team's ability to reason about the system as a whole erodes — not because any single change is wrong, but because the medium the team uses to think together has been quietly diluted.
The teams that get this right are not the ones with the smartest agents. They are the ones who treat their codebase's idiom as a contract the agent is expected to honor, give the agent the context it needs to honor it, and audit the result. Everyone else is buying velocity now and paying interest on it for years.
- https://www.augmentcode.com/tools/ai-coding-assistants-for-large-codebases-a-complete-guide
- https://www.augmentcode.com/tools/13-best-ai-coding-tools-for-complex-codebases
- https://blog.kilo.ai/p/ai-coding-assistants-for-large-codebases
- https://kodus.io/en/ai-generated-code-is-messing-with-your-technical-debt
- https://dev.to/alexcloudstar/ai-generated-code-is-creating-a-technical-debt-crisis-nobody-is-auditing-4cjc
- https://arxiv.org/pdf/2603.28592
- https://arxiv.org/pdf/2601.21276
- https://medium.com/@addyosmani/my-llm-coding-workflow-going-into-2026-52fe1681325e
- https://simonwillison.net/2025/Mar/11/using-llms-for-code/
- https://www.honeycomb.io/blog/how-i-code-with-llms-these-days
- https://codersera.com/blog/agents-md-vs-claude-md-vs-cursor-rules-comparison-2026/
- https://blog.buildbetter.ai/agents-md-complete-guide-for-engineering-teams-in-2026/
- https://www.lancedb.com/blog/building-rag-on-codebases-part-1
- https://www.augmentcode.com/guides/7-ai-agent-tactics-for-multimodal-rag-driven-codebases
- https://www.augmentcode.com/guides/ai-technical-debt-compounds-spec-driven-development
