Skip to main content

Snapshot Tests Lie When Your Model Is Stochastic

· 11 min read
Tian Pan
Software Engineer

The first time a junior engineer on your team types --update-snapshots and pushes to main, your test suite stops being a test suite. It becomes a transcript. The diffs still render in green and red, the CI badge still flips to passing, but the signal has quietly inverted: instead of telling you whether the code is correct, the suite now tells you whether anyone bothered to look at the output. With deterministic code that ratio is acceptably low, because most diffs really are intentional. With a stochastic model on the other end of a network call, the same workflow turns every PR into a coin flip, and every reviewer into a rubber stamp.

Snapshot testing was a beautiful idea for a deterministic world. You record what render(<Button />) produced last Tuesday, you assert that this Tuesday it produces the same string, and any diff is, by definition, a behavior change worth a human eyeball. The pattern survived Jest, Vitest, Pytest, the whole React ecosystem, and a generation of UI snapshot extensions, because the underlying contract held: same input plus same code equals same output. The contract does not hold for an LLM call. Same input plus same code plus same prompt produces a different string, and the difference is not a bug — it is the product working as designed.

What teams discover the first time they wire a snapshot test around a model call is that the file does lock down — for about a week. Then the upstream provider ships a quiet 0.1 patch to the model, your prompts start producing slightly different phrasings, and seventeen tests turn red on a Tuesday morning. Half the team regenerates the snapshots without reading the diffs because nothing in the diff is "wrong," it is just different. The other half spends an afternoon scrolling through wall-to-wall paragraph reflows looking for the one regression that actually matters, finds nothing, and joins the regenerate camp. By month three, the snapshot review has become muscle memory: see red, run the update flag, push. The suite is now a write-only data structure.

Why The Maintenance Treadmill Is Not Solvable In Place

The instinct, when this starts happening, is to raise the tolerance. Maybe the snapshot only fails on big diffs. Maybe we strip whitespace before comparing. Maybe we hash a normalized version. Each of these is a step away from "this output is correct" and toward "this output is approximately the same shape as last week's output," which is a different and much weaker claim.

The reason the treadmill cannot be patched is that snapshots assume the value is the spec. Once you accept that the value will drift across model patches, prompt revisions, and even retry attempts within a single request, you are no longer testing the value — you are testing some lossy projection of it, and the lossy projection is something you made up under deadline pressure. The semantics of "the test passed" become "an arbitrary normalization function I wrote at 4pm did not flag a difference," and that sentence does not belong in a release pipeline.

The cleanest way to see this is to ask: what does a green snapshot test guarantee about user-visible behavior? With deterministic code, the answer is "the function produced exactly the bytes I approved." With a model call, the honest answer is "the function produced bytes that, by some metric I baked into the comparator, were close enough to bytes I approved at some point." Engineers do not write deploy gates with sentences that contain "by some metric" and "at some point" if they have a choice.

A Testing Taxonomy That Survives Stochasticity

Replacing snapshots is not one decision; it is four, and each takes a different layer of the test suite. The pattern I have seen work in practice, across teams shipping LLM features, is to split the suite into bands based on what aspect of the system is being asserted, and to use a different oracle for each band.

Semantic equivalence over field-level matches. When the output is prose or a structured response with prose fields, the assertion you actually want is "this answer means the same thing as the reference," not "this answer is byte-equal to the reference." Embedding-based cosine similarity (with a threshold tuned per field, often 0.85–0.95 depending on how strict the field is) catches genuine drift while ignoring rephrasings. For structured outputs, you check fields that have a stable surface — a price, a category, a yes/no — with strict equality, and fields that are prose with semantic equivalence. Mixed schemas need mixed assertions; treating the whole blob with one comparator is how you end up back on the treadmill.

Distribution tests over single-sample asserts. A test that calls the model once and asserts on the result is asking a probabilistic system a yes/no question, which means its false-failure rate is fixed by temperature and your luck that morning. Run the test ten or twenty times against a fixed input and assert on the distribution: the share of responses that include a required key fact, the variance of a numeric field, the rate at which the model refuses. This is more expensive in tokens, but it is the only way to make the failure signal correspond to a real regression rather than a tail draw.

Invariant tests over value tests. For most product features, the things that must be true are not values, they are properties. The output must be valid JSON. The summary must mention the user's name if it appears in the input. The classifier must never return a category outside the allowed set. The translation must not contain English words when the target is Japanese. These are property-based tests in the classical sense, and they have the enormous advantage that you do not have to update them when the model changes — the property is the spec, and the model either satisfies it or does not. Recent research on property-based testing for LLM-generated code formalizes this further, treating high-level invariants as the oracle when input-output pairs are unstable, and the same shift applies to LLM-generated content.

Regression tests against a frozen baseline cohort with a tolerance budget. This is the spiritual successor to the snapshot, and the only place a "this is what we approved" artifact still belongs. You curate fifty to a few hundred representative inputs (the "golden cohort"), you run them on the candidate version, you score each output with whatever scorer the band requires, and you check that the aggregate score is within a tolerance — say, two percent — of the production baseline. The output of any single example is allowed to change. The aggregate is not. This gives you a deterministic gate over a stochastic system, which is the trade you actually need.

The Fixture Pattern That Keeps The Deterministic Suite Honest

Not every test in an LLM application talks to a model. Most of the code around the model is plain old deterministic software — schema validators, retry handlers, tool dispatchers, response parsers, prompt templating, cache layers — and it deserves the cheap, fast, byte-equal assertions that built modern test suites. The mistake is letting the stochastic concerns leak into the deterministic suite, which happens whenever a unit test pulls in a real model call to set up its fixture.

The fixture pattern that scales is to pin the model output for everything that does not need to be stochastic. Record a small number of canonical responses — one per scenario — into a fixture file, and have the deterministic test path read from that file instead of the model. The recording itself is a separate, manual, batched activity that runs on a real model when prompts or schemas change, and produces fixtures with a recorded model version, recorded prompt, and recorded sampling parameters embedded in the file. The deterministic suite asserts against the recording with the same byte-equal tools you have always used. The stochastic suite, by contrast, talks to the live model with explicit acknowledgment that its assertions are probabilistic.

Two suites, two contracts, two failure modes. The deterministic suite tells you whether the code around the model still works. The stochastic suite tells you whether the model still works for your product. Conflating them — by letting one or two live model calls into the unit tests "because they are fast enough" — is how you end up with a flaky CI no one trusts and a snapshot file no one reads.

"Go Check The Diff" Is Not A Code Review Activity

The deeper organizational point hiding inside all of this is that LLM behavior verification is not a code-review concern. Code review evaluates whether a change to source files is correct, safe, and consistent with the codebase. It assumes the reviewer can read the diff and reason about its consequences. When the diff is "the model produced four paragraphs of slightly different prose," there is no consequence the reviewer can reason about without running an evaluation, and pretending otherwise — by stuffing model outputs into PR comments and asking a human to LGTM them — is theater.

The activity that the diff actually requires is an eval: a structured comparison of candidate behavior against baseline behavior, scored by the appropriate oracle for the task, surfaced in a dashboard or report rather than a unit-test diff. The output of an eval is "score went from 0.82 to 0.79, here are the ten worst regressions, here are the five biggest improvements," and that is the artifact a human can review. The output of a snapshot diff is "ten thousand characters changed, please scroll." One of those produces a decision; the other produces fatigue.

The cultural shift that needs to happen on most teams shipping AI features is to remove the model-output snapshot from the code review surface entirely, and to put the eval results there instead. The PR template stops asking "did you regenerate snapshots?" and starts asking "did the eval pass, and if scores moved, which examples drove it?" The CI runs the eval as a job, posts the delta as a comment, and gates the merge on a tolerance budget rather than a diff approval. This is more infrastructure than --update-snapshots, but it is infrastructure that produces signal instead of noise.

What To Do On Monday

If you have an LLM feature in production with a snapshot suite around it, the cheapest first move is also the most useful one: count how many of those snapshots have been regenerated in the last ninety days without a meaningful PR comment about why. If the number is non-trivial, the suite is already a rubber stamp, and the question is not whether to replace it but how quickly. Convert the suite to a baseline cohort, score the cohort with whatever oracle is appropriate to the field shape, and gate on a tolerance — anything is better than a string compare.

If you are starting a new LLM feature, write the eval before you write the second prompt iteration. The temptation is to ship the happy path first and instrument it later, but "later" is when the prompt has fifty subtle dependencies on the eval you have not built, and any refactor becomes a flying-blind exercise. A small, scrappy eval — even ten examples scored with embedding similarity — is dramatically more useful than a polished snapshot suite, and it composes with everything you build on top of it.

The throughline is the same one that runs through most of AI engineering practice in 2026: the abstractions that worked in deterministic software do not generalize, and the teams that try to bend them to fit pay the cost in flakiness, fatigue, and shipped regressions. Snapshots were a great pattern. They are not the right pattern here. Letting them quietly degrade into a transcript while the test suite turns into a checkbox is the kind of slow failure that does not show up on any dashboard until a customer-facing regression goes out and someone asks why the tests were green.

Stochastic systems demand statistical assertions. The sooner the suite reflects that, the sooner the green badge means something again.

References:Let's stay in touch and Follow me for more thoughts and updates