3 posts tagged with "llm-testing"

Compliance Reviewer as Eval Author: Why Legal Should Be Writing Your Test Cases

May 9, 2026 · 13 min read

Software Engineer

The most useful adversarial prompt I have seen for an enterprise LLM did not come from a red team, a security researcher, or a prompt engineer. It came from a senior compliance attorney who asked the model, in plain English, to "tell me which of the three retirement annuities discussed earlier in this thread is the best one for a 62-year-old approaching their first required minimum distribution." The model produced a confident, thoughtful, beautifully-formatted recommendation. That output, had it been sent to a customer, would have been a textbook FINRA suitability violation — an unsuitable individualized recommendation made without the supervisory infrastructure that securities rules require around personalized advice.

The compliance attorney spotted the failure mode in about four seconds. The engineering eval suite, which had a hundred-plus carefully constructed cases for hallucination, refusal calibration, and tool-use accuracy, had no concept that this particular response shape was illegal. Not low quality. Not a hallucination. Illegal. And the workflow at the company at the time had her reading sample outputs in a Google Doc and writing memos, rather than checking a test case into the regression suite. So her catch lived in a memo, the memo got summarized in a launch-readiness slide, and the next month a refactor of the system prompt regressed the behavior because nobody had a failing test pinned to it.

That is the gap I want to argue we should close: the compliance reviewer should be authoring eval cases directly, and those cases should be the artifact that gates release — not the document review that produced them.

Dependency Injection for AI: Mocking Model Calls Without Losing Test Fidelity

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

The cruelest bug report I have ever investigated came from a team whose CI was bright green for six weeks. Every prompt change shipped through a full test suite. Every tool call had a mock. Every integration test asserted the exact string the LLM had returned in staging. And every one of those tests was lying. Their provider had shipped a minor model update, the output format drifted by a few characters, and the mocks — frozen to last quarter's strings — happily validated code that was now returning malformed JSON to users.

That is the shape of the failure mode I want to talk about. Dependency injection for AI applications is easy to get right at the code-shape level (your prompt-runner takes a client interface, you pass a fake in tests, done). It is hard to get right at the fidelity level, which is the property that matters: does a passing test predict that production will not break? Most test suites I see trade away fidelity without noticing, because the seam where you replace the real model is also the seam where you lose signal about the thing you actually care about.

The fix is not "mock more carefully." The fix is a layered fixture architecture, a deliberate seam design, and a test confidence taxonomy that tells you when cheap fakes are enough versus when you must pay for a real model call. Those three things compose into a suite that still runs in seconds on every commit but stops lying about production behavior.

Capability Probing: How to Map Your Model's Limitations Before Users Do

April 11, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams discover their model's limitations the same way users do — in production, through support tickets. A customer reports the extraction pipeline silently dropping nested addresses. An internal user notices the summarizer hallucinating dates past 8,000 tokens. A compliance review finds the classifier confidently labeling ambiguous cases instead of abstaining.

None of these are surprises. They are capability boundaries that were always there, waiting for the right input to expose them. You either map those boundaries before deployment, or your users map them for you — one incident at a time.

The difference is cost: a probe failure in CI costs you five minutes. A capability gap discovered in production costs you a customer's trust. The discipline of finding those boundaries systematically is capability probing — fault injection for language models. You wouldn't ship a bridge without load-testing the joints. The same logic applies to any model you put in front of users.

About Tian Pan