Skip to main content

The Library Version Your Coding Agent Remembers Wrong

· 10 min read
Tian Pan
Software Engineer

The diff looks clean. The agent imported the right module, called what looks like the right function, and TypeScript stayed quiet. The PR description even cites the docs. Then the build runs in CI and the call explodes with TypeError: x is not a function — because the function was split into two in a minor bump eight months ago, and the agent generated against the version of the library that existed inside its training data, not the version installed in your package.json.

This is not the kind of failure the "LLMs hallucinate" frame prepares you for. The model isn't inventing an API that never existed. It's remembering an API that existed once and doesn't anymore. The mental model the agent is reasoning from is a snapshot frozen at training time. The world has moved on. The codebase has moved on. And the agent has no idea, because nobody told it.

Recent measurements put hard numbers on the problem. A 2025 ICSE study found that across seven evaluated LLMs, deprecated API usage rates in generated code ranged from 25% to 38%. A separate benchmark on Python libraries showed model accuracy dropping from 56.1% on APIs that existed before the training cutoff to 32.5% on APIs introduced after — a 24-point cliff that the model itself cannot perceive. Training-cutoff drift is not a tail risk; it's a structural property of every coding agent in production.

Training-Cutoff Drift Is Not Generic Hallucination

The standard hallucination framing — the model is confidently wrong about something it never knew — misses what is happening here. The model knew. It studied your library. It can recite the function signature, the parameter order, the return shape, and even the older edge cases. What it does not know is which of those facts are still true.

Treating this as "the model made something up" leads teams to reach for the wrong tools. They tighten the system prompt. They add a verification step that asks the model to double-check its own output. Both interventions assume the agent has a way to recognize the gap between its memory and the world. It does not. From the model's perspective, every recalled API is equally vivid; the deprecated one and the current one occupy the same activation neighborhood.

Calling this "version drift" is more precise. The model is reasoning from a dated snapshot of the ecosystem, and every library it touches has its own drift rate. A sleepy library that has not shipped a breaking change in three years produces almost no drift. A framework mid-rewrite produces drift on every other generated line. The agent has no concept of which library it is in, and treats both with the same confidence.

This matters operationally because the mitigation strategies are different. Generic hallucination is reduced by grounding — give the model more relevant context and it makes fewer things up. Training-cutoff drift is reduced by recency grounding specifically: the model needs to be shown what changed since its cutoff, not just what exists today. Showing it the current docs without flagging the delta lets it pattern-match the parts that look familiar and silently override them with its priors.

The Failure Modes That Hide in Plain Sight

A few specific patterns produce drift that is hard to catch in review.

The compiling-but-broken call. TypeScript type definitions often carry historical shape for a long time after a runtime rename. A function gets deprecated in a minor bump; the old name continues to exist as an alias for two minor versions; the type definition keeps the old name throughout. The agent generates a call to the old name. The compiler is happy. Runtime is not, but only when that branch executes — and tsc green is the signal most teams trust the agent against. This is the failure mode where the agent looks competent right up to the moment a test runs the path.

The package.json lookalike. Coding agents that read package.json to ground themselves often do worse than agents that ignore it. The agent sees "react": "^18.2.0", recognizes the version as one it has seen many examples of, and generates code that matches a slightly different point release inside that range. It is not even hallucinating the version — it is correctly recognizing the major and over-confidently filling in the minor. The version constraint became a permission to generate from memory rather than a prompt to verify.

The transitive shape carrier. A library you do not import directly re-exports a type from a library you do. The shape of that type changed in the upstream library. Your direct dependency still exposes the old shape because it has not bumped its peer. The agent generates against the canonical upstream documentation it learned from. The transitive shape disagrees. The error message points at neither library and the team spends an afternoon finding a bug that was never theirs to write.

The "removed" config key. Configuration keys are particularly drift-prone because they live in YAML or JSON the type system does not police. A key gets renamed in a minor bump; the new key is required; the old key is silently ignored. The agent emits the old key. The application starts up. The feature the key was supposed to enable does nothing. Production reads "the deployment worked" because nothing crashed, and the regression is detected when a customer asks why something stopped working.

The shared shape of all four is that the failure surface points away from the agent. The crash is in user code, the type checker passed, the lint passed, and the linked-from-history docs the reviewer skimmed described the deprecated behavior. The agent is the last suspect, because the agent looks right.

What Closing the Gap Actually Looks Like

The pattern that holds up under load is to treat the installed version of every library the agent touches as untrusted context that must be re-established at the start of every relevant turn. Concretely, four interventions stack.

Version-prelude before generation. Before the agent writes a call into a file that imports library X, it must explicitly state what version of X is installed and what differences from its training-time knowledge it expects to encounter. This is not a hand-wave step — the model has to commit to specifics ("react 19.0.2 is installed; the major bump removed the implicit children prop type and changed Suspense boundary semantics") because committing to specifics forces it to recognize when it does not actually know. The cost is one extra turn per library, paid once per workspace session.

Docs retrieval scoped to the actual installed version. A retrieval step that fetches the current API reference for the specific symbol the agent is about to use, version-pinned to what is installed. The 2025 RAG benchmarks show this lifts post-cutoff accuracy by about 13.5 percentage points — meaningful but not a fix. It is meaningful enough to be a standard turn in any agent that writes against third-party libraries; it is not enough to be the only intervention.

Changelog fingerprinting. The most expensive part of the docs is rarely the part that matters. What matters is the delta — what changed since the model's cutoff. A pre-built index that, given a library and a cutoff date, returns the list of renamed, removed, and signature-changed symbols, loaded into the agent's context whenever those libraries appear in the workspace, gives the model the right anchor: "these specific things are different than you remember." This is far more token-efficient than the full docs and far more behaviorally effective, because it directly addresses the failure mode rather than hoping coverage will.

Post-generation type-definition cross-check. After the agent has written code, a static pass that walks the AST, identifies every call into an imported third-party symbol, and grep-checks the call shape against the actually-installed type definitions — not the agent's memory of them. This catches the compiling-but-broken pattern. It is cheap, deterministic, and does not require the model at all.

These four are individually imperfect and together close most of the drift surface. The team that ships only one of them — typically docs retrieval, because it sounds like the answer — gets a quality bump and continues to ship deprecated-API regressions at a rate they can no longer explain.

Building Evals That Measure Drift, Not Just Accuracy

Most coding-agent evals select problems from snapshots that predate the model's cutoff. This is not a methodological flaw the way overfitting is — it is a measurement gap that flatters every model and gives teams no signal about the failure mode they will actually hit in production.

A drift-aware eval requires deliberate construction. Take a set of libraries with known breaking changes shipped after the model's training cutoff. Build scenarios where the correct generation requires recognizing that the API has changed. Grade not just whether the generation is correct, but whether the agent detected the skew — did it ask, did it fetch docs, did it flag uncertainty — versus confidently generating against the old shape.

This is harder to assemble than a generic code-generation eval because the eval ages out as the model rolls forward, and because the "right" behavior is partly a judgment call (when should the agent ask versus when should it just look it up). But it is the only eval that measures the thing your production agent fails at. A team that does not have a drift eval is choosing to discover drift through user incidents.

A related signal worth instrumenting is the agent's abstention rate on cutoff-adjacent symbols. An agent that confidently generates against every symbol it sees, regardless of library recency, is mis-calibrated even when the generations happen to be correct. The right behavior on a library that bumped a major version after the cutoff is to slow down, fetch docs, and verify — not to generate fluently. Measuring abstention rate gives the team a leading indicator that holds up across model versions.

The Architectural Realization

The deeper point is that an LLM's knowledge of every library it has ever seen is dated as of a fixed moment, and that moment is not the same as the moment the agent is generating in. Every coding agent is a time traveler from its training cutoff, and the gap between then and now is filled with breaking changes the agent has no way to discover from its weights alone.

Teams that internalize this stop trying to make the model "smarter about versions" and start treating recency as an explicit context input the way they would treat any other piece of state the model does not have. Installed versions go in the system context. Changelogs since the cutoff go in the system context. The boundary of "what the model is allowed to assume" gets drawn at the cutoff date and the team's tooling fills in the delta.

The teams that do not do this ship an agent that keeps writing yesterday's code into tomorrow's branches — confidently, fluently, and incorrectly — and discover the gap one production incident at a time. The fix is not a better model. The fix is giving the model a way to learn what changed since it last looked.

References:Let's stay in touch and Follow me for more thoughts and updates