Skip to main content

Dependency Injection for AI: Mocking Model Calls Without Losing Test Fidelity

· 10 min read
Tian Pan
Software Engineer

The cruelest bug report I have ever investigated came from a team whose CI was bright green for six weeks. Every prompt change shipped through a full test suite. Every tool call had a mock. Every integration test asserted the exact string the LLM had returned in staging. And every one of those tests was lying. Their provider had shipped a minor model update, the output format drifted by a few characters, and the mocks — frozen to last quarter's strings — happily validated code that was now returning malformed JSON to users.

That is the shape of the failure mode I want to talk about. Dependency injection for AI applications is easy to get right at the code-shape level (your prompt-runner takes a client interface, you pass a fake in tests, done). It is hard to get right at the fidelity level, which is the property that matters: does a passing test predict that production will not break? Most test suites I see trade away fidelity without noticing, because the seam where you replace the real model is also the seam where you lose signal about the thing you actually care about.

The fix is not "mock more carefully." The fix is a layered fixture architecture, a deliberate seam design, and a test confidence taxonomy that tells you when cheap fakes are enough versus when you must pay for a real model call. Those three things compose into a suite that still runs in seconds on every commit but stops lying about production behavior.

The fidelity gap no one talks about

Mocking an LLM call is trivially easy and deceptively misleading. You have a chat.completions.create somewhere, you replace it with a MagicMock that returns {"choices": [{"message": {"content": "..."}}]}, and the test passes. The problem is that your mock is a snapshot of a specific model's output at a specific point in time. The actual contract between your code and the model is much richer than that snapshot: tokenization behavior, tool-call argument shapes, refusal patterns, whitespace quirks in structured outputs, the way temperature interacts with your prompt, and the non-obvious cases where a model suddenly decides your system prompt means something different than you intended.

When a team writes assertions against the mocked string, they are not testing their application against the model. They are testing their application against their memory of the model. That memory rots. Providers push silent updates. The mock keeps passing. A report from one team tracking this behavior noted that bugs reached production on LLM-generated tests whose assertions were tied to the current implementation rather than the contract — exact return values and mocked internals instead of behavioral properties. The test suite went green after a refactor while the path that actually ran in production was broken.

The tell is simple: if your LLM mock returns a hand-written string that matches no real model output anywhere in your system, you have written a test that can only fail when you change the test. It has no coupling to reality.

The layered fixture architecture

The way out is to treat LLM test fixtures the way database testing treats data fixtures: as a deliberate hierarchy with different cost/fidelity tradeoffs at each tier. The three tiers that pay off in practice:

Tier 1 — Stub fakes. A FakeListLLM-style object that returns canned responses by index. LangChain ships one, and every serious LLM framework has an equivalent. These exist to let you test your application's plumbing: control flow, error handling, retry logic, the state machine around tool calls. They are not testing the model; they are testing everything around the model. Use them liberally. They run in milliseconds. Write a lot of them.

Tier 2 — Recorded cassettes. Use pytest-recording (VCR.py) or an equivalent to record real HTTP traffic to the provider once, then replay the recording on every subsequent test run. The first run is real and slow; every run after is deterministic and fast. Crucially, the cassette contains the actual shape of the provider's response — headers, streaming deltas, tool-call JSON schemas, the subtle encoding of refusals. When the provider changes that shape, you can re-record and diff the cassette to see what moved. This is how you pin behavior against a specific model version without paying for every test run. Projects like vcr-langchain and baml_vcr exist specifically for this pattern.

Tier 3 — Live calls. A small, hand-curated set of tests that actually hit the provider, with real API keys, on a cadence that is not every commit. Nightly is common. Pre-merge-to-main is better. These are your canary against model drift: they answer the question "is the model still doing what we expect it to?" regardless of what our frozen mocks and recorded cassettes say. You want this tier to be tiny and expensive rather than large and averaged-out, because its job is to catch behavioral regression in the model itself.

The mistake is collapsing these tiers. Running everything with real calls is expensive, slow, and non-deterministic. Running everything with stub fakes is cheap and lies to you. The architecture works because each tier tests a different thing.

Seam design: making prompts and tools injectable without polluting product code

The tier architecture presupposes you can actually substitute the model at test time, which is a design decision your production code has to enable. This is where a lot of LLM codebases fall into a bad equilibrium: the product code grows if TESTING: branches, or the "model client" becomes a god-object with a use_mock flag, or every call site directly imports the SDK and there is nowhere to intercept.

The principle borrowed from hexagonal architecture applies directly: the domain code should not know that a specific provider exists. It should know that some object satisfies a contract — generate(prompt, tools) -> Response — and that object is passed in. This is a plain old port/adapter split. The domain-level prompt orchestration never imports openai or anthropic. An adapter layer translates between the port's contract and the provider's SDK. Tests pass a different adapter; production passes the real one.

Two specific refinements make this work for LLMs specifically:

The first is prompt injection in the dependency-injection sense, not the security sense. Prompts are inputs to your system, not hard-coded constants inside it. If a prompt template lives as a string literal inside your orchestration function, your tests cannot vary it and your evals cannot version it. Put prompts behind a loader interface. The loader can be file-based in production, dictionary-based in tests, or database-backed when you start doing prompt A/B tests. The orchestration function never cares.

The second is tool injection. Agent tools should be passed in as a list of objects satisfying a tool-interface contract, not imported from a registry. In tests you pass stub tools that record what the agent tried to call. In production you pass the real tools. This is the only way you can write a test that asserts "the agent called the search_orders tool with status='pending'" without mocking the entire tool implementation and praying the mock matches reality. It also lets you simulate tool failures cleanly — a stub that raises a specific exception when called with a specific input is trivial to write, whereas instrumenting real tools to fail on command is miserable.

The test confidence taxonomy: what each tier can and cannot tell you

The fixture tiers are not interchangeable. They answer different questions with different confidence levels. Being explicit about this is what stops the "tests pass but production breaks" failure mode, because you stop expecting Tier 1 to tell you about Tier 3's concerns.

A rough taxonomy:

  • Tier 1 (stub fakes): high confidence in control-flow correctness (your retry logic, your error handling, your parsing of well-formed outputs). Zero confidence in model behavior. Zero confidence that a real model will produce outputs matching your stub.

  • Tier 2 (recorded cassettes): high confidence in protocol correctness against a pinned model version (headers, schemas, streaming behavior, tool-call formats). Medium confidence in behavior — the recording is accurate but frozen, and the model has likely drifted since the recording was taken. Zero confidence about current live model behavior.

  • Tier 3 (live calls): high confidence in behavior as of the test run. But slow, expensive, flaky, and dependent on having curated inputs whose correct outputs you can actually assert on without brittle exact-matching.

Knowing which tier to reach for depends on what the test is claiming. A test that says "when the user sends garbage, our handler returns a fallback response" is a Tier 1 test — the model is irrelevant. A test that says "when we ask for a JSON object, the parser can read what the model returns" is Tier 2 — you need real-shape output but you do not need it to be live. A test that says "the model still refuses to answer medical questions in French" is Tier 3 — no recording or stub can substitute for a real call against the current model, because that is exactly the behavior you are testing.

A safety net for drift: the separate role of evals and traces

The layered fixture suite is necessary but not sufficient. It pins behavior against known-at-the-time-of-writing expectations. It does not tell you when the model itself changes underneath you. That job belongs to two things that live outside the test suite proper.

Evals are golden-dataset regression tests that run on a schedule, not a commit. They take a set of curated inputs with known-good outputs (or known-good properties) and run them against whatever model version is currently in production. When the score drops, you know the model has drifted even though all your unit tests are green. Keep the dataset small enough to review by hand — 50 to 200 examples is plenty — but make every example a case you truly care about. A golden dataset of a thousand auto-generated examples is a tarpit; a golden dataset of eighty real user prompts someone chose deliberately is a scalpel.

Production traces close the remaining gap. Your test suite and eval harness cover the behavior you anticipated. Traces cover the behavior you did not. Link each production trace to the exact prompt version, model configuration, and retrieval context that produced it. When you discover a regression six weeks into a deployment, the first question is always "what changed" and you will not be able to answer it without that provenance. This is not a test-fidelity problem; it is a test-fidelity complement.

The rule of thumb

If you take away one discipline from this: mocks should test your code's response to the model, not the model. Anything that encodes "the model returns this exact string" is a test that will break silently the moment the string changes. Anything that encodes "the model returns some string, and our parser does the right thing with it" is a test whose correctness survives provider updates. When you catch yourself hand-writing a mock response, ask whether a real response — recorded once, replayed forever — would tell you the same thing more faithfully. Almost always, the answer is yes, and the effort to flip from hand-written to recorded is smaller than the ongoing cost of maintaining a mock that keeps drifting out of truth. The test suite you want is the one that still means something six months from now, not the one that runs fastest today.

References:Let's stay in touch and Follow me for more thoughts and updates