Skip to main content

Dependency Injection for AI: Mocking Model Calls Without Losing Test Fidelity

· 10 min read
Tian Pan
Software Engineer

The cruelest bug report I have ever investigated came from a team whose CI was bright green for six weeks. Every prompt change shipped through a full test suite. Every tool call had a mock. Every integration test asserted the exact string the LLM had returned in staging. And every one of those tests was lying. Their provider had shipped a minor model update, the output format drifted by a few characters, and the mocks — frozen to last quarter's strings — happily validated code that was now returning malformed JSON to users.

That is the shape of the failure mode I want to talk about. Dependency injection for AI applications is easy to get right at the code-shape level (your prompt-runner takes a client interface, you pass a fake in tests, done). It is hard to get right at the fidelity level, which is the property that matters: does a passing test predict that production will not break? Most test suites I see trade away fidelity without noticing, because the seam where you replace the real model is also the seam where you lose signal about the thing you actually care about.

The fix is not "mock more carefully." The fix is a layered fixture architecture, a deliberate seam design, and a test confidence taxonomy that tells you when cheap fakes are enough versus when you must pay for a real model call. Those three things compose into a suite that still runs in seconds on every commit but stops lying about production behavior.

The fidelity gap no one talks about

Mocking an LLM call is trivially easy and deceptively misleading. You have a chat.completions.create somewhere, you replace it with a MagicMock that returns {"choices": [{"message": {"content": "..."}}]}, and the test passes. The problem is that your mock is a snapshot of a specific model's output at a specific point in time. The actual contract between your code and the model is much richer than that snapshot: tokenization behavior, tool-call argument shapes, refusal patterns, whitespace quirks in structured outputs, the way temperature interacts with your prompt, and the non-obvious cases where a model suddenly decides your system prompt means something different than you intended.

When a team writes assertions against the mocked string, they are not testing their application against the model. They are testing their application against their memory of the model. That memory rots. Providers push silent updates. The mock keeps passing. A report from one team tracking this behavior noted that bugs reached production on LLM-generated tests whose assertions were tied to the current implementation rather than the contract — exact return values and mocked internals instead of behavioral properties. The test suite went green after a refactor while the path that actually ran in production was broken.

The tell is simple: if your LLM mock returns a hand-written string that matches no real model output anywhere in your system, you have written a test that can only fail when you change the test. It has no coupling to reality.

The layered fixture architecture

The way out is to treat LLM test fixtures the way database testing treats data fixtures: as a deliberate hierarchy with different cost/fidelity tradeoffs at each tier. The three tiers that pay off in practice:

Tier 1 — Stub fakes. A FakeListLLM-style object that returns canned responses by index. LangChain ships one, and every serious LLM framework has an equivalent. These exist to let you test your application's plumbing: control flow, error handling, retry logic, the state machine around tool calls. They are not testing the model; they are testing everything around the model. Use them liberally. They run in milliseconds. Write a lot of them.

Tier 2 — Recorded cassettes. Use pytest-recording (VCR.py) or an equivalent to record real HTTP traffic to the provider once, then replay the recording on every subsequent test run. The first run is real and slow; every run after is deterministic and fast. Crucially, the cassette contains the actual shape of the provider's response — headers, streaming deltas, tool-call JSON schemas, the subtle encoding of refusals. When the provider changes that shape, you can re-record and diff the cassette to see what moved. This is how you pin behavior against a specific model version without paying for every test run. Projects like vcr-langchain and baml_vcr exist specifically for this pattern.

Tier 3 — Live calls. A small, hand-curated set of tests that actually hit the provider, with real API keys, on a cadence that is not every commit. Nightly is common. Pre-merge-to-main is better. These are your canary against model drift: they answer the question "is the model still doing what we expect it to?" regardless of what our frozen mocks and recorded cassettes say. You want this tier to be tiny and expensive rather than large and averaged-out, because its job is to catch behavioral regression in the model itself.

The mistake is collapsing these tiers. Running everything with real calls is expensive, slow, and non-deterministic. Running everything with stub fakes is cheap and lies to you. The architecture works because each tier tests a different thing.

Seam design: making prompts and tools injectable without polluting product code

The tier architecture presupposes you can actually substitute the model at test time, which is a design decision your production code has to enable. This is where a lot of LLM codebases fall into a bad equilibrium: the product code grows if TESTING: branches, or the "model client" becomes a god-object with a use_mock flag, or every call site directly imports the SDK and there is nowhere to intercept.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates