Skip to main content

2 posts tagged with "llm-testing"

View all tags

Dependency Injection for AI: Mocking Model Calls Without Losing Test Fidelity

· 10 min read
Tian Pan
Software Engineer

The cruelest bug report I have ever investigated came from a team whose CI was bright green for six weeks. Every prompt change shipped through a full test suite. Every tool call had a mock. Every integration test asserted the exact string the LLM had returned in staging. And every one of those tests was lying. Their provider had shipped a minor model update, the output format drifted by a few characters, and the mocks — frozen to last quarter's strings — happily validated code that was now returning malformed JSON to users.

That is the shape of the failure mode I want to talk about. Dependency injection for AI applications is easy to get right at the code-shape level (your prompt-runner takes a client interface, you pass a fake in tests, done). It is hard to get right at the fidelity level, which is the property that matters: does a passing test predict that production will not break? Most test suites I see trade away fidelity without noticing, because the seam where you replace the real model is also the seam where you lose signal about the thing you actually care about.

The fix is not "mock more carefully." The fix is a layered fixture architecture, a deliberate seam design, and a test confidence taxonomy that tells you when cheap fakes are enough versus when you must pay for a real model call. Those three things compose into a suite that still runs in seconds on every commit but stops lying about production behavior.

Capability Probing: How to Map Your Model's Limitations Before Users Do

· 10 min read
Tian Pan
Software Engineer

Most teams discover their model's limitations the same way users do — in production, through support tickets. A customer reports the extraction pipeline silently dropping nested addresses. An internal user notices the summarizer hallucinating dates past 8,000 tokens. A compliance review finds the classifier confidently labeling ambiguous cases instead of abstaining.

None of these are surprises. They are capability boundaries that were always there, waiting for the right input to expose them. You either map those boundaries before deployment, or your users map them for you — one incident at a time.

The difference is cost: a probe failure in CI costs you five minutes. A capability gap discovered in production costs you a customer's trust. The discipline of finding those boundaries systematically is capability probing — fault injection for language models. You wouldn't ship a bridge without load-testing the joints. The same logic applies to any model you put in front of users.