Skip to main content

Spec, Code, Tests, One Author: The Independence You Quietly Lost

· 11 min read
Tian Pan
Software Engineer

When the same model writes the requirements, implements them, and authors the assertions that say it is correct, "all tests pass" is no longer evidence the feature works. It is evidence the model is internally consistent. Those are different things, and the difference is the entire point of having tests in the first place.

The standard story we tell about test suites is that they are a second opinion. The author wrote the code with one mental model of the requirement, the test author wrote the assertions with a slightly different mental model, and the points where the two models disagree are where the bugs live. That story depends on the test author having a different cognitive vantage point than the code author. Strip out the difference in vantage points and the test suite stops carrying any independent information about correctness — it only carries information about consistency.

Most teams shipping AI-assisted software in 2026 have not noticed they made this trade. The model writes a spec doc, writes the code that implements the spec, and writes the unit tests that pin the code to the spec. CI goes green. The PR description is generated from the diff. The reviewer skims the summary and approves. Every artifact was authored by the same brain in the same session, and "all tests pass" reads as proof the feature works because that is what green builds used to mean back when humans wrote the tests.

A 1986 paper called this twenty years before LLMs

The closest prior art to this failure mode predates large language models by four decades. In 1986, Knight and Leveson ran the experiment that quietly killed the field of N-version programming as a serious reliability strategy. They had 27 teams at two universities independently write programs from the same specification. The hypothesis under test was simple: programs developed independently would fail independently, so a system that ran multiple versions and voted on the output would be vastly more reliable than any single version.

The data did not cooperate. Roughly half of the 45 detected faults were correlated across versions. In 1,255 out of one million test cases, multiple versions failed simultaneously, sometimes as many as eight at once. The statistical test for independence was rejected at greater than 99% confidence. The lesson the field absorbed was that independent human authors writing from the same specification still produce correlated faults, because they share a culture, a curriculum, an interpretation of the spec, and the standard CS-major instinct for which edge case is "obviously" the tricky one.

Now apply that lens to a single model writing at three layers. The Knight-Leveson result said that humans with different educations and different mistakes still correlate at around 50% on the same spec. A model writing spec, code, and tests in one session has a correlation closer to 100%, because all three artifacts emerge from one prior, one training distribution, and one in-context interpretation. Whatever the model misunderstood about the requirement, it misunderstood three times. There is no second opinion in the room.

This is the framing the field already has the vocabulary for. It is just called "common-mode failure" in safety-critical systems, and it is the reason aviation does not actually run identical flight computers and vote — the votes are too correlated to be worth the cost. The thing teams call "AI-assisted development" today is, viewed from the safety-engineering side, an N=1 system pretending to be N=3 because the same brain produces three artifacts that look like they came from different authors.

The dangerous failure is silent absence, not a wrong test

When teams worry about model-authored tests, they usually worry about the wrong failure. The vivid worry is "the model wrote a wrong test that asserts the buggy behavior." That happens, but it is loud. Tests that assert wrong behavior tend to look strange in review and tend to break in production the first time a real input lands.

The dangerous failure is silent. A behavior the user cares about is missing from all three artifacts because the model's prior never surfaced it. The spec does not mention it. The code does not handle it. The test does not assert it. CI is green not because the system handles the case correctly, but because no one — neither the implementer nor the assertion author, who are the same author — thought to ask whether the case existed.

A concrete shape this takes: a billing endpoint where the spec covers refunds, the code implements refund logic, and the tests cover the obvious refund paths. None of the three artifacts mention what happens when the customer's billing currency changed between the original charge and the refund request. The model never thought to spec it because no example in its training data made the case salient. The code therefore handles the refund in the original currency without checking, the test suite covers the cases the spec mentioned, and the green build certifies "refund logic implemented." Three days after launch a customer who switched currencies gets a refund of the wrong amount, and the postmortem reveals that no artifact in the system ever acknowledged the case existed.

This is exactly the failure mode that N-version programming fails to catch. If three independent humans all forget to handle the same edge case because the spec was silent about it, the voter has nothing to vote on — three identical wrong answers vote unanimously for the wrong outcome. The unspoken assumption that test suites detect bugs depends on the test author asking questions the implementer did not think to ask. A single author asks the same questions in both seats.

Independence requires a different author at one of the three layers

There are three places to inject genuine independence into the spec-code-tests pipeline, and at least one of them needs to land if "tests pass" is to mean what it used to mean.

The first option is the cheapest and most underrated: a human writes the spec. Not "a human reviews the spec the model wrote" — a human authors the requirements document, in their own words, before the model sees the code. The spec then becomes an external contract that the model is being measured against, and the model's interpretation of the requirement is no longer the requirement. This is how product organizations shipped software before the model could plausibly write the PRD, and the discipline still works. The cost is that someone has to write a real PRD, which is unfashionable; the benefit is that you reintroduce the only audit layer in the pipeline whose author is structurally different from the implementer.

The second option is to use a different model family at the test layer than at the code layer. If the code is written by Model A and the tests are written by Model B, and A and B were trained on different corpora with different curation pipelines and different post-training, the priors disagree in places that matter. The 2025 research on LLM-as-oracle and on differential testing across implementations consistently shows that cross-family disagreement uncovers bugs that same-family pipelines miss. The discipline is not "use any second model" — it is "use a model whose training data distribution is genuinely different," because two models from the same lab share a lot of their prior even when the weights differ.

The third option is to add a property-based test layer authored adversarially. Property-based tests assert invariants — "for all inputs, the output must satisfy X" — rather than examples. The epistemic stance behind a property-based assertion is different from the stance behind a unit test: instead of "I imagined this input and predicted this output," it is "I claim this invariant holds across all inputs, prove me wrong." Recent work on property-based testing for LLM-generated code shows substantial improvements when properties are authored separately from the implementation, because invariants force the test author to think about the shape of the problem rather than the surface of a few examples. A property-based layer does not have to be authored by a human — it has to be authored from a different prompt, in a different session, with a different framing than the code generation prompt, so that the same blind spot does not transfer.

The "human review" trap

The most common reflex when teams hear "you need an independent audit layer" is to add a human PR reviewer. This is fine when the reviewer actually re-derives the requirements. It is theater when the reviewer reads what the model wrote.

The trap is that AI-assisted PRs come bundled. The diff includes the code, the tests, and a generated PR description that explains what the change does and why. A reviewer who reads the description first, then skims the diff to confirm the description matches, has not added an independent vantage point — they have added a check that three internally consistent artifacts are internally consistent. They will trivially agree with each other. That is what "internally consistent" means.

Genuine review independence requires the reviewer to read the original feature request — the ticket, the bug report, the customer email — and judge whether the diff addresses it, without reading the model's summary first. Better yet, the reviewer writes their own one-sentence understanding of the requirement before opening the diff, and then checks whether the diff matches what they wrote. The state-of-AI-code-quality data from late 2025 was clear on this: when AI review tools shipped, the rate of human comments on PRs collapsed to under 20%, and AI-authored changes produced roughly 1.7x more issues per PR than human-authored ones. The two numbers travel together. Reviewers stopped doing the work that the test suite no longer did either.

Test independence is a distributed-systems property

The cleanest way to think about all of this is to import a frame from distributed systems and stop pretending tests are special. Three identical replicas of a service do not catch each other's bugs. A logic bug in the request handler crashes all three replicas the same way on the same input, and the load balancer dutifully routes traffic to whichever crashed replica answered first. Replication adds availability when the failures are uncorrelated — hardware faults, network partitions, single-machine memory pressure — and adds nothing when the failures are correlated, because three correlated answers are still one answer.

A single-author spec-code-tests triple is one replica wearing three hats. The hats look different — one is a markdown file, one is a function, one is an assertion — but they are all driven by the same brain in the same session. When the brain has a blind spot, all three artifacts have the blind spot. When the brain misunderstands a requirement, all three artifacts agree on the misunderstanding. The voter at the bottom of the pipeline — the CI runner — sees three votes that came from one mind and reports them as if they came from three.

The leadership question that follows from this is uncomfortable and worth asking out loud: does your AI engineering organization have any audit layer that is not itself authored by the same model family it is auditing? If the spec was written by the model, the code was written by the model, the tests were written by the model, and the PR review was done by the model — even with a human nominally signing the approve button — then "tests pass" carries roughly the same epistemic weight as "the author says this works." That used to be the baseline state of software before tests existed. Tests are how the field climbed out of it. Letting one author write all three artifacts is climbing back in, slowly, while everyone admires the velocity.

The fix is not more tests. The fix is different test authors — humans at the spec layer, a different model family at the test layer, or property-based assertions designed adversarially in a different session. None of these are expensive. All of them require admitting that the green build you got from a single-author triple is not the same green build you used to get when humans wrote the tests, and pricing the difference into how much you trust it.

References:Let's stay in touch and Follow me for more thoughts and updates