Skip to main content

Your Coding Agent Is a Junior Engineer Who Never Reads the Tests

· 10 min read
Tian Pan
Software Engineer

The benchmark numbers tell a strange story. On SWE-bench Verified, multiple agent products running the same underlying model — Auggie, Cursor, Claude Code, all on Opus 4.5 — produced wildly different results. Auggie solved 17 more problems out of 731 than its closest peer despite the identical brain. The gap was scaffolding: how the agent was prompted, what context it was given, which tools it could call, and what the harness did when it got confused. The model is a commodity. The scaffolding around it is the product.

This is the same realization mature engineering teams reached about junior engineers a decade ago. A bright graduate doesn't ship value because the model is good. They ship value because the README is current, the test suite is fast, the code review rubric catches the same six mistakes every time, and someone wrote a CONTRIBUTING.md that names the constraints. Strip that scaffolding away and the same person produces locally coherent, globally wrong code that breaks production invariants the team didn't know to write down.

The teams squeezing real productivity out of coding agents have noticed this and are doing for the agent exactly what they should have already been doing for the junior hire. The teams that aren't are blaming the model.

The Onboarding Parallel That Almost Holds

The analogy of "agent as junior engineer" is useful but worth taking seriously rather than waving at. The places it holds and the places it breaks are both instructive.

A junior engineer asks questions when stuck. An agent pattern-matches confidently from training data and writes a function that compiles, runs, and silently violates a rate limit they had no way to know about. There's a documented case from March 2026 of an AI-generated Slack cleanup job that hammered Slack's global one-request-per-second API limit because the rate constraint wasn't visible in the local scope of the method being edited. The code was professionally written. The invariant lived somewhere else — in a paragraph of a vendor doc the agent never read.

A junior engineer reads the test suite to figure out the contract. An agent ignores the tests unless explicitly pointed at them. The default agent behavior is to read the file you opened, scan a few neighbors, and start typing. Tests are documentation that executes; an agent that doesn't read them is an engineer who hasn't realized the codebase has a spec.

A junior engineer accumulates implicit knowledge over months. By month six, they know which directory is haunted, which service can't be restarted on Fridays, and that the three-letter abbreviation everyone uses doesn't match what's in the code. An agent starts every session at zero. Whatever isn't in the prompt, the loaded files, or the discoverable structure of the repo doesn't exist for it. That's not a bug to fix; it's a constraint to design around.

Once you accept the constraint, the engineering question becomes: what would I write down for a junior who started fresh every Monday morning?

The Scaffolding That's Doing the Work

Look at the published best practices from teams shipping serious agent-driven workflows and the same five artifacts keep appearing.

A CLAUDE.md or AGENTS.md at the repo root. Not a marketing README — a tight document that names the architecture in three sentences, points at the entry points, lists the package managers and build commands, and flags the files an agent must never edit (generated code, lockfiles, vendored dependencies). The HumanLayer team's guide is blunt: this is the highest-leverage file in your repo for AI-assisted development, and every token in it is competing for the agent's attention. Keep it lean. Anything that isn't universal to the project belongs in a sub-directory file or a skill, not at the root.

Tests that double as executable documentation. Not because tests catch agent bugs — they do, but that's a secondary benefit — but because tests are the only artifact that names a contract precisely enough that an agent can satisfy it without ambiguity. A failing test is a prompt the agent can act on without misinterpretation. A README sentence saying "the function should handle empty input gracefully" leaves room for six different implementations, three of which break a caller you forgot existed.

A code-review rubric that catches the same mistakes the agent reliably makes. Across teams, the failure modes cluster: missing auth checks, error handling that swallows the actual error, retries without backoff, calls to a deprecated internal API the agent saw in old training data, plausible-looking but globally wrong concurrency code. If you've watched an agent for a week and seen the same six patterns, write them down as automated checks or a review checklist. This is the agent equivalent of the "six things every junior misses on their first PR" list every team has and pretends not to need.

Architecture documentation that names the invariants the codebase enforces but doesn't document. Industry research on tribal knowledge consistently finds that tacit understanding — not technical setup — is the dominant productivity blocker for new hires. The same tribal knowledge gap applies to agents, except worse, because the agent can't ask the senior at the next desk. If your service must never call the database from inside the request handler, write that down. If the auth context has to be passed explicitly because there's no thread-local, write that down. The discipline isn't writing prose for its own sake; it's converting tribal knowledge into something the agent can load into a prompt.

Fast feedback loops. Slow tests, slow compiles, and slow tool responses don't just frustrate humans; they break the agent's loop entirely. An agent running test-edit-test-edit on a five-minute test suite will time out, lose context, or give up and start guessing. The teams getting real productivity from agents have invested in test parallelization, hot-reload, and quick local validation paths because they had to — the agent's loop is more sensitive to feedback latency than a human's.

Test-First Prompting, Scope-Limited Tasks

There are two prompting practices that close most of the gap, and both come straight from how you'd brief a junior engineer who you didn't entirely trust yet.

First: test-first. Hand the agent a failing test and tell it to make the test pass without modifying the test file. This constrains the agent's solution space to exactly the contract you care about. It eliminates the "write a function that pattern-matches what success usually looks like" failure mode, because success is now defined by a runnable artifact, not a prose description. It also produces an audit trail — the test that was failing, the test that's passing, the diff that bridged them. A code review against that diff is a thousand times faster than reviewing a feature implemented from a paragraph of intent.

Second: scope-limiting. The Red Hat Developer guidance from earlier this year makes the case clearly — narrow-scope prompts produce dramatically more reliable behavior than broad ones, and most production agent prompts cluster between 800 and 2000 tokens for a reason. A prompt that reads like a junior-engineer task description ("change the formatTimestamp function in lib/dates.ts to return UTC instead of local time, and update the three callers in services/billing/") is far more reliable than a senior-engineer instinct ("clean up the timezone handling"). The first leaves no room for the agent to invent scope; the second invites it to refactor things you didn't ask about.

The inversion most teams need is uncomfortable. Senior engineers prompt the way they'd think — broadly, instinctively, leaving the boring decisions for later. That's the wrong register for an agent. The right register is the brief you'd give a new contractor on day one: explicit, scoped, with the edges named.

The Failure Mode That Ships Anyway

There's a class of agent failure that no amount of scaffolding entirely prevents, and recognizing it is the difference between a team that ships AI-assisted code safely and a team that doesn't.

The Slack rate-limiting incident from March is the canonical example. The code was correct in every local sense. It compiled, passed lint, handled the obvious edge cases. It just made a system-level assumption that didn't hold — the assumption that the API it was calling could be hit in a tight loop. The invariant wasn't anywhere the agent could see it. It existed only in vendor documentation, in the head of the engineer who'd integrated Slack two years prior, and in the rate-limit dashboard nobody had wired into the dev environment.

This is the locally-coherent-globally-wrong failure mode, and it's where agents reliably exceed the junior-engineer analogy in damage potential. A junior would have asked. The agent shipped a PR, got it approved by another reviewer running a non-thorough review, and merged. Production paged.

The defense isn't more prompting; it's review gates that assume the local code is fine and look elsewhere. Does this change cross a service boundary you've drawn before? Does it call an external API in a context that's never called external APIs? Does it touch a path that an existing alarm watches? Each of those questions is a check that's mechanical to encode and that catches the failure mode the agent will keep producing no matter how good the model gets, because the failure mode lives in the gap between the file the agent edited and the system the file is part of.

The Documents That End Up Mattering

The leadership realization that arrives after a team has run this for a quarter is the one worth landing on. The agent-onboarding documents — the CLAUDE.md, the architecture-invariants doc, the test-suite-as-spec, the review rubric — turn out to be exactly the documents the next human hire needs too. The team that was systematizing "what would I tell a junior" for the agent's benefit ends up with onboarding artifacts the next senior hire reads on day three and reduces their ramp time by a month.

This isn't a coincidence. The forcing function the agent provides is that the codebase's tribal knowledge has to become explicit because the agent can't infer it. That's a forcing function teams should have applied to themselves years ago and didn't, because the cost of tribal knowledge — the studies put it at roughly 23 hours per developer per month in knowledge-seeking time — was paid invisibly, in the lost productivity of every new hire's first six weeks. The agent makes the cost legible because the agent fails loudly when the documentation is missing.

The two outcomes are linked. A team that systematizes its agent scaffolding gets a more effective agent and an onboarding pipeline that finally works for humans. A team that treats the agent as a model problem — chasing benchmarks, swapping vendors when results disappoint, adjusting temperature — gets neither. The model is a commodity now. The scaffolding is the product. And the scaffolding looks remarkably like the engineering hygiene every team already knew it needed, written down at last because something with no patience finally demanded it.

References:Let's stay in touch and Follow me for more thoughts and updates