Skip to main content

The Test the Agent Wrote That Tests Nothing

· 10 min read
Tian Pan
Software Engineer

Ask a coding agent to "add tests for this module" and you will get tests. They will be neatly formatted, they will follow your project's conventions, and they will pass. Coverage will tick up. The PR will look like diligence. And a meaningful fraction of those tests will not be able to catch a single bug you might plausibly introduce.

This is not a story about a model being dumb. The agent did exactly what it was asked. The problem is that "add tests" and "add tests that constrain the behavior" are different requests, and only one of them is easy to verify at a glance. A green checkmark looks identical whether it sits on top of a real assertion or a tautology.

The result is a suite that grows in line count and shrinks in power. You end up with more files, more CI minutes, more things to maintain — and roughly the same probability of shipping a regression as before you started.

A passing test and a useful test are not the same thing

There is a quiet assumption baked into most engineering orgs: that a test which passes is a test which works. For human-written code that assumption is mostly harmless, because a human writing a test usually has a failure in mind. They picture the bug, then write the assertion that would catch it.

An agent generating tests does not start from a failure. It starts from the code. It reads the implementation, infers what the implementation currently does, and writes assertions that confirm the implementation currently does that. Every test passes on the first run — not because the code is correct, but because the test was derived from the code.

That distinction is the whole article. A test should be an independent statement of intent: given this input, the answer must be X. When the answer X is computed by reading the same code under test, the test has no independent opinion. It is a mirror. It will stay green through any refactor that preserves behavior and — critically — through any refactor that breaks behavior in a way the mirror also reflects.

The tell is subtle. A tautological test does not look broken. It looks thorough. It has a descriptive name, a clear arrange-act-assert shape, and a real assertion line. You have to ask a harder question to catch it: if the implementation were wrong, would this test still pass? For a large share of agent-generated tests, the answer is yes.

Why agents drift toward tests that assert nothing

None of this is the model being lazy. It is the model optimizing the signal it can see.

A coding agent's reward, broadly, is "the task is done and the suite is green." It does not receive a separate signal for "this test would catch a future regression," because that signal does not exist at generation time — nobody has introduced the future bug yet. So the agent optimizes the visible target: produce tests, make them pass, move on. This is Goodhart's Law with a test runner attached. The measure (tests pass, coverage rises) becomes the target, and once it is the target it stops measuring what you cared about.

Three concrete patterns fall out of that incentive.

Over-mocking. The fastest way to make a test pass deterministically is to mock every dependency so it returns exactly the shape the implementation expects. An empirical study of agent activity across real repositories found agents add mocks in 36% of test commits versus 26% for humans, and in 2025-era repositories agent-generated mocks made up a steadily rising share of all mocking. A mock that returns precisely what the code wants is not a test of integration — it is a restatement of the code's assumptions. If the real dependency drifts, the test never notices.

Happy-path bias. Agents prefer minimal scaffolding that matches the obvious pattern. Obvious patterns are happy paths. Invalid inputs, boundary values, partial failures, and concurrent states are where bugs actually live, and they are exactly the cases an agent skips unless explicitly told to cover them — because they require imagining a failure rather than describing the code.

Assertion laundering. The worst case: the test computes its expected value by calling the same helper, the same serializer, the same transform that the code under test uses. Both sides of the assertEqual run the identical logic. The test cannot fail unless the language itself breaks. It still counts toward coverage. It still shows up green.

Each of these is locally reasonable. Each produces a test that exists, passes, and raises a metric. Together they produce a suite that is large, green, and blind.

Coverage is the metric that lets this hide

If tautological tests were obviously worthless, nobody would merge them. They get merged because we have a number that says they are working: code coverage.

Coverage measures whether a line executed. It says nothing about whether a line was checked. A test that calls a function and discards the result, or asserts the result is "not null," will light up every line in that function as covered. The coverage tool reports 90% and the dashboard turns green. This is why a suite can show 100% line coverage and still, when probed, catch only a tiny fraction of injected bugs — every line ran, almost nothing was asserted.

Agents are very good at moving this particular number. Coverage is the most legible target in the testing world: it is reported per-PR, it has a threshold gate, it goes up and to the right. An agent asked to "improve test coverage" will improve test coverage. It will not improve your defect-detection rate, because that is not what the words mean and not what the tool measures.

So coverage does double damage. It gives the agent a target that is easy to satisfy without doing the real work, and it gives the reviewer a green number that suppresses the instinct to look closer. The metric that was supposed to be a floor becomes a blindfold.

Mutation testing: ask the suite to prove it can fail

If the failure mode is "tests that cannot fail," the fix is to make some code fail on purpose and see whether the suite reacts.

That is mutation testing. A mutation tool takes your source, makes a small semantically meaningful change — flips a > to a >=, swaps a + for a -, replaces a return value with a constant, deletes a conditional — and reruns the suite against each mutant. If a test fails, the mutant is "killed": some assertion genuinely depended on that behavior. If every test still passes, the mutant "survived": no test in your suite cares about that line, no matter what coverage claimed.

The mutation score — percentage of mutants killed — is the number coverage was pretending to be. Teams routinely discover a module sitting at 93% line coverage and a far lower mutation score, with whole categories of mutants strolling through untouched. A suite with 100% coverage and a 4% mutation score executes every line and would miss almost every bug. Mutation testing surfaces that gap because it tests the assertions, not the execution.

For agent-written tests this is the natural enforcement layer. You cannot eyeball a hundred generated tests for assertion quality. You can run mutation testing on the diff and require that newly added tests kill the mutants in the code they claim to cover. A test that kills zero mutants is precisely the test that tests nothing — and now you have a build step that says so. Tools like Stryker for JavaScript and TypeScript, or mutmut for Python, make this practical to wire into CI, even if you start by running it only on changed files to keep it fast.

Working with agents so the tests actually constrain

You do not have to stop letting agents write tests. You have to stop trusting "it passes" as the acceptance criterion. A few practices change the outcome.

Write the failing test first, then hand it over. The single most reliable guard is the red-green discipline: a test must be observed failing before the code that satisfies it exists. If you write the assertion yourself — or at minimum confirm a new test fails against unfixed code — you guarantee it exercises the behavior. An agent that writes test and implementation together never sees red, so it never proves the test can fail. Make "show me this test failing first" part of the loop.

Review the assertion, not the coverage delta. When an agent's PR adds tests, the diff line that matters is the assert. Read it. Ask whether the expected value was written down independently or computed by the code under test. Ask whether a wrong implementation would survive it. The coverage number at the bottom of the PR is the least informative thing in the review.

Treat surviving mutants as failing CI. Put mutation testing on the changed surface of every PR and gate on it. This converts "is this test any good" from a judgment call a tired reviewer makes at 6pm into a deterministic check. It also gives the agent a real signal — a target that actually correlates with catching bugs — instead of the proxy it has been gaming.

Constrain mocking explicitly. Since agents over-mock by default, say so in the configuration the agent reads. State that external boundaries may be mocked but logic under test may not, that a mock returning the exact expected shape is a smell, and that integration-level behavior should be tested against real or faked dependencies, not assumptions.

Keep adversarial cases on the human. Agents are genuinely good at filling out happy-path coverage once the hard cases are framed. Reserve your effort for the inputs that break things — boundaries, malformed data, partial failures, races — and let the agent handle the volume around them.

The suite is a hypothesis, not a trophy

A test suite is a claim: if this code regresses, something here will turn red. Every tautological test, every over-mocked test, every happy-path-only test weakens that claim while leaving the trophy — the green checkmark, the coverage badge — fully intact. That is what makes the failure mode dangerous. It does not look like absence. It looks like diligence.

Coding agents will keep generating tests faster than anyone can read them, and most of those tests will pass. The job has shifted. It is no longer "do we have tests" or even "what is our coverage." It is "can this suite still fail when the code is wrong" — and the only honest way to answer that is to break the code on purpose and watch.

Run the mutant. If the suite stays green, you did not write tests. You wrote a decoration.

References:Let's stay in touch and Follow me for more thoughts and updates