Skip to main content

LLM-as-Compiler Is a Metaphor Your Codebase Can't Survive

· 10 min read
Tian Pan
Software Engineer

The pitch is seductive: describe the behavior in English, the model emits the code, ship it. Prompts become the source, artifacts become the target, and the LLM sits between them like gcc with a friendlier front-end. If that framing held, the rest of software engineering — review, refactoring, architecture — would be downstream of prompt quality. It does not hold. And the codebases built on the assumption that it does start failing in a pattern that is now boring to diagnose: around month six, nobody can explain why a particular function looks the way it does, and every incremental change produces a wave of duplicates.

The compiler metaphor is the root cause, not vibe coding, not model quality, not prompt skill. It is a category error that quietly excuses teams from doing the work that keeps a codebase coherent over years. When you believe the model is a compiler, the generated code is an implementation detail, the same way assembly is an implementation detail of a C program. When you are actually running a team of non-deterministic, context-limited collaborators, the generated code is the asset — and the prompts are closer to Slack messages than to source.

What Compilers Actually Promise

It is worth being literal about what a compiler is, because the metaphor gets most of its rhetorical force from vagueness. A compiler takes a deterministic input in a formal language with an unambiguous grammar and produces an artifact that is, up to a small number of well-documented flags, byte-identical across runs. Two engineers compiling the same commit produce the same binary. The mapping from source lines to generated instructions is traceable; the mapping from language features to runtime semantics is specified; changes to the compiler are gated by standards bodies, regression suites, and release notes. You can throw away the binary and regenerate it without surprise.

None of this is true of an LLM. Temperature zero does not save you, because any change to the prompt — a new example, a reordered instruction, a reformatted whitespace — moves you outside the fixed point. The model itself is not pinned by anything you control: the weights can change on the provider's cadence, not yours, and the same prompt three months later can produce code that is recognizably different in structure. There is no standards body, no specification of "what GPT-5 means when you say 'gracefully handle rate limits,'" and there is no guarantee that regenerating a module from its original prompt recovers anything close to what is currently in your repo.

This is not a pedantic complaint. It is the hinge that decides whether the generated code is authoritative or the prompt is. In the compiler world, the source is authoritative and the binary is disposable. In the LLM world, the generated code is authoritative — it is what actually executes, what your tests exercise, what your on-call engineer reads at 3am — and the prompt is a lossy, one-time artifact of how it got there. Teams that invert this, treating the prompt as source and the code as binary, find themselves unable to answer "why does this function do X" with anything better than "because the model produced it that way."

The Failure Modes Show Up Around Month Six

The damage is not immediate and that is what makes it insidious. Early velocity looks phenomenal. A feature that would have taken two weeks ships in two days. Stakeholders notice. The ratio of generated code in the repo climbs. Then, around the time the product has enough surface area that every change touches something the model wrote months ago, three patterns surface together.

The first is duplicate-logic sprawl. GitClear's 2025 analysis of 211 million changed lines found that copy-pasted code rose from 8.3% to 12.3% of changed lines between 2021 and 2024, and the frequency of duplicated five-line-plus blocks jumped roughly eightfold. In the same window, the share of changed lines associated with refactoring fell from 25% to under 10%. The mechanism is mechanical: the model does not see your whole codebase, so it cannot propose "reuse parseCurrency from lib/money.ts." It proposes a fresh parseCurrency inline, and tab-accept is easier than pulling it out. Multiply that over a year and you have five implementations of the same concept, each slightly different, all shipped to prod, none discoverable from the others.

The second is onboarding-hostile opacity. Generated code tends to be locally correct and globally anonymous. It solves the task it was prompted on, but it carries none of the "why" that human-written code accumulates through argument, revision, and the ambient pressure of having your name on a blame line. When a new engineer asks "why is this retry loop shaped like this, and not like the one two files over," the answer is often that there is no answer — the two loops were produced from two different prompts by two different engineers in two different weeks, and no one owns the choice. Birgitta Böckeler's framing of this in the harness-engineering literature is sharp: agents have "no social accountability, no aesthetic disgust at a 300-line function, no intuition that 'we don't do it that way here.'" Code review was the historical mechanism that transmitted those intuitions. When review is compressed to "does it pass CI," transmission stops.

The third is drift between intent and artifact. A recent CodeRabbit analysis of 470 open-source pull requests reported that AI co-authored PRs carried roughly 1.7x more "major" issues than human-written ones, with misconfigurations 75% more common and security vulnerabilities 2.74x more common. The 2024 DORA report, looking at a much larger population, found that higher AI adoption correlated with declines in both delivery throughput and system stability — the opposite of what the compiler metaphor predicts. An academic study of 8.1 million pull requests linked AI tool adoption to a 30–41% increase in measured technical debt. None of these findings is about the model getting something dramatically wrong on a single task. They are about a thousand small divergences between what the prompt implied and what the code actually does, accumulating faster than anyone reconciles them.

The Disciplines the Metaphor Silently Cancels

Once you accept the compiler framing, a set of engineering practices quietly falls off the table because they no longer seem to apply. You do not code-review the output of rustc. You do not refactor your object files. You do not write architecture documents for your linker. If the LLM is a compiler, then review, refactoring, and architectural stewardship are ceremonies you have graduated from.

The concrete consequences are predictable. Review discipline degrades into a glance at the diff and a thumbs-up, because the reviewer implicitly treats the model's output as a trusted translation rather than as a junior engineer's first draft. Refactoring cadence collapses, both because the per-sprint volume of generated code overwhelms the team's capacity to consolidate it and because refactoring a module someone barely understands feels like unpaid archaeology. Architectural judgment is outsourced to the model's defaults, which is to say, to whatever shapes the training distribution happened to bias toward — often the most common, not the most appropriate for your system's constraints.

Taste is the quiet casualty. The senior engineer who pushes back on a 300-line function, or insists that the new feature reuse an existing abstraction instead of minting a parallel one, is doing the kind of high-judgment work that models cannot replicate because it is not a translation task. It is curation. It is enforcement of an internal grammar that only your team knows. The compiler metaphor has no place for curation; gcc does not have a taste. If your organization adopts the metaphor, it stops hiring for taste, stops rewarding it in review, and stops protecting the time it takes to exercise it. Two years later, the codebase is a monument to whatever the model's defaults happened to be in 2025.

The Honest Framing: A Prolific Junior Engineer

The framing that actually matches the failure modes is the one that has quietly emerged among practitioners who ship AI-assisted code at scale: the LLM is a fast, prolific junior engineer with patchy context and no skin in the game. This is less flattering to the technology and more useful as an operating assumption.

Under this framing, the disciplines that the compiler metaphor canceled come back, and they come back with teeth. Generated code gets reviewed as generated code, not as a compilation artifact — meaning the reviewer is looking for the same things they would look for in a junior's PR: misdiagnosed requirements, premature abstractions, ignored existing utilities, plausible-looking-but-wrong error handling. Refactoring is scheduled deliberately, not hoped for. The DX code rot research recommends teams allocate 10–20% of each sprint to maintenance; under the compiler metaphor that number is zero because there is nothing to maintain. Architectural review gates — "does this new module fit the conventions of the directory it lives in" — become explicit steps in the pipeline, often enforced by a second AI pass configured to care specifically about consistency rather than correctness.

The prompt, under the junior-engineer framing, is no longer source. It is an instruction given in a meeting, as lossy and context-dependent as any other verbal direction. You do not check it into the repo in the same sense you check in code. The code itself is the artifact that must survive, and the question "six months from now, when someone has to change this, will they be able to?" is the one you are actually optimizing against — not "did the prompt produce working code today."

The framing also clarifies a common mistake: treating the model as a single junior engineer rather than a rotating cast of them. Each generation is a fresh mind with no memory of the last one. If you would not let ten different contractors each write one feature with no shared architect and no code-review overlap, do not let ten different LLM sessions do it either. The team convention that stops this in human engineering — a tech lead who reads every PR, an architecture doc that predates the feature, a refactoring sprint every quarter — is exactly the convention you need to keep.

Keep the Metaphor Small

The compiler metaphor is not useless. It is accurate for the moment of generation itself: you gave an input, you got an output, and within that single turn the abstraction holds well enough. The mistake is extending it to everything downstream. Generated code is not a disposable binary; it is code your team will live with. The prompt is not source; it is a memo. The model is not a compiler; it is a collaborator with a particular pattern of strengths and a particular pattern of things it will confidently get wrong.

The teams that will be maintainable into 2027 are the ones that keep the metaphor small. They let it describe how a line of code was produced, and they refuse to let it describe how that code should be reviewed, refactored, owned, or architected. The six-month wall is not inevitable, but avoiding it requires believing — in writing, in process, in staffing — that the LLM is one engineer on your team, not the compiler that replaces the rest of them.

References:Let's stay in touch and Follow me for more thoughts and updates