The Snapshot Trace Test: Production Traces as Your Regression Suite
The eval set most teams run as their regression suite was hand-curated by an engineer in week three of the project, frozen by week six because nobody wanted to touch it before launch, and is now being used in month nine to gate deploys. The product has shifted twice. The user base has tripled. The cases the LLM actually sees in production overlap with that frozen suite by maybe forty percent. When the suite passes, nobody trusts it; when it fails, nobody knows whether the failure is real or whether the case is just stale. The team writes a doc proposing a "v2 eval set" and never gets around to it.
Meanwhile, every request the system has handled in production has been recorded in a tracing backend. Every prompt, every tool call, every intermediate output, every refusal, every retry — all of it sitting in object storage, time-indexed and span-tagged, ready to be replayed. The highest-fidelity test corpus the team will ever have is already on disk. They built an eval suite from scratch instead of reading from it.
This is the snapshot trace test: pin a representative slice of production traces, replay them on every candidate change, and assert behavioral invariants on the replay output. The cases come from real users, not from a synthetic prompt your eng intern drafted in a Notion doc. The distribution tracks the actual product, not the product as it existed when the suite was frozen. The test suite stays alive because the data flowing into it is alive.
The trick is not the recording. Most teams already have tracing. The trick is the discipline of what to pin, what to assert, and what to ignore — and the realization that none of this works unless your trace recording is itself treated as a versioned, queryable, first-class dataset.
Stop hand-curating, start sampling
The hand-curated eval set has a structural problem: it represents what the engineer who wrote it imagined the system would handle, not what users actually send. Three weeks in, that imagined distribution might be ninety percent accurate. Six months in, after a marketing campaign brings in a new segment, after a feature ships that changes call patterns, after a model upgrade subtly shifts which queries get routed where — the imagined distribution is a fossil.
Production traces don't have this problem. They are, by definition, the distribution. The only question is which slice to keep.
Naive sampling — "grab a thousand random traces from last week" — is worse than no sampling. The bulk of production traffic is the easy stuff: the queries that any model handles cleanly, the workflows that complete in two tool calls, the requests that look identical to each other. Random sampling oversamples the easy cases and undersamples the failure modes you actually need to guard against. Your regression suite ends up testing whether the model can still answer "what's the weather in Paris," which is not interesting.
What you want is stratified sampling weighted toward edge cases. Bucket the trace corpus along axes that matter — user segment, query intent, tool-call depth, latency band, refusal vs. completion, error vs. success — and sample from each bucket independently. Allocate a non-trivial share of the budget to the long tail: the queries that triggered five tool calls, the refusals that the user pushed back on, the traces where the model self-corrected. These are the cases where a model swap or prompt change is most likely to silently regress.
Refresh the pinned set on a cadence. A weekly cron that retires the oldest ten percent of traces and replaces them with fresh samples from the same strata keeps the suite anchored to the current distribution. This sounds like overhead until you compare it against the alternative: a frozen suite that drifts a few percent further from reality every week and has to be wholesale rewritten in a panic when someone notices.
The exact-equality trap
Here is the failure mode that wastes a quarter of engineering time: a team stands up trace replay, asserts that the new model's output equals the recorded output character-for-character, watches half the suite turn red on the first run, and spends three months chasing benign variance.
LLM outputs are not deterministic in the way function outputs are. Even with temperature zero, the same prompt can yield different tokens across model versions, across providers, even across the same provider on different days as inference kernels get re-tuned. Phrasing shifts. Bullet ordering shifts. The model says "you can do X" instead of "to do X, you can." None of this matters for the user. All of it breaks exact-equality assertions.
The team that ships exact-equality replay tests learns this lesson the slow way. They mark cases as "expected to differ" until the exception list is longer than the test list. They invent regex normalizers that strip whitespace and lowercase everything, and the regexes catch the easy cases but not the genuine semantic shifts. They eventually conclude that "trace replay doesn't work for LLMs" and go back to hand-curated golden outputs, which is the wrong conclusion.
The right conclusion is that equality is the wrong primitive. Stochastic systems need stochastic assertions: not "did the output match" but "does the output, on this specific axis, fall within the band the rubric defines." The assertion library you actually need is different from the one Jest gives you, and pretending otherwise is how the suite ends up in the trash.
The assertion library you actually need
A snapshot trace test that works in production has three classes of assertion, each tuned to a different layer of the system, and none of them are exact equality.
Semantic equivalence on outputs. The user-facing answer should mean the same thing as the recorded answer, even if the words differ. This is what LLM-as-judge is for: a separate model — ideally a stronger one, with a structured rubric — scores whether the new output preserves the salient claims, the safety properties, the factual content, and the formatting commitments of the recorded one. Pin the judge model, version the rubric, and treat both as part of the test suite, not infrastructure to be swapped silently. A judge upgrade that shifts scoring is a regression you'll mistake for a model regression in your system under test.
Structural equality on tool-call sequences. The agent should call the same tools in the same order with the same shape of arguments, even if the natural-language reasoning differs. This is where exact-equality does work, because tool calls are structured: the function name, the argument schema, and the order are discrete and deterministic in a way that prose isn't. A trace where the model previously called search_db → fetch_doc → summarize and now calls summarize → search_db → fetch_doc is a behavior change that prose-level semantic checks will miss. Argument values often need fuzzier comparison (a query string that paraphrases the same intent should pass), but the call graph itself should be pinned tightly.
Latency bands rather than point estimates. Every trace records wall-clock latency for each span. Asserting "latency must equal 1.42 seconds" is absurd; asserting "p95 latency on this trace bucket must stay under two seconds" is the contract you actually have with users. Replay the pinned set, compute the latency distribution, compare against the recorded distribution as bands rather than points, and flag shifts that exceed a threshold. A model swap that doubles tail latency on tool-heavy traces is the kind of regression that exact-equality replay would never catch and that latency-band assertions catch on the first run.
The mental model: outputs get judged, tool calls get diffed, latency gets bounded. Each layer assert against the property it has, not against equality.
Recording infrastructure is the prerequisite
Here is the org-level realization that takes longer to internalize than it should: the eval suite is downstream of the trace-recording infrastructure, not the other way around.
If trace recording is sampled, lossy, or missing fields you need for replay — request payload, system prompt version, tool definitions, model configuration, RNG seed — then the eval suite cannot be built no matter how skilled the team running it. You can't replay a trace whose system prompt wasn't captured. You can't diff a tool-call sequence whose argument schemas were truncated by a JSON serializer. You can't bound latency against a trace whose timing data was dropped because the tracing backend rate-limited the span.
The teams that run snapshot trace tests well do three things at the trace layer first. They capture the full request envelope — model name and version, system prompt hash, tool definitions, decoding parameters, the entire user input — so a trace from six months ago can be replayed against a candidate today and the comparison is apples-to-apples. They version the trace schema so a trace recorded under schema v3 doesn't silently lose fields when the recording layer ships v4. And they expose traces as a queryable dataset, not as a stream of log lines: SQL, filters by metadata, joins against user-segment tables, the full data warehouse posture. A trace store that can only be searched by request ID is a liability for any eval discipline more serious than spot-checking individual failures.
This reframes a lot of the budget conversation. When a team asks "should we hire someone to own evals," the more accurate question is often "should we invest in the tracing backend that owns the upstream data the eval suite needs." The eval suite is the visible artifact. The trace backend is the soil it grows in. A team that builds the eval suite on top of a tracing system that records eighty percent of fields is going to write the same evals four times as the tracing layer slowly fills in the gaps.
What to do on Monday
Three concrete moves to get from "we have tracing" to "we have a snapshot trace regression suite that catches real regressions."
First, audit your trace recording for replay completeness. Pick three production traces and ask whether you could, today, reconstruct the exact request — including system prompt, tool list, model version, decoding params — and run it again. If the answer is no for any of those fields, that's the first thing to fix. The eval suite waits.
Second, pick fifty traces, stratified across your top use cases and including a deliberate skew toward edge cases, and pin them as the v0 regression set. Resist the temptation to start with five hundred — a small, well-chosen, refreshed set beats a sprawling, frozen one. Tag them by intent and segment so the eval report breaks down failures by stratum, not by a single aggregate pass rate.
Third, build the assertion layer in three tiers, not one. LLM-as-judge for output semantics. Structural diff for tool-call sequences. Distribution bands for latency. Wire each tier as a separate signal in the eval report, so a change that improves output quality but doubles latency surfaces as exactly that — improvement and regression — rather than as a single mushy "score went down" verdict.
The teams that do this well stop having the "are our evals real" conversation. The traces are real because users sent them. The judgments are honest because the rubric is versioned. The latency check is grounded because the recording captured it. The suite stays alive because the data flowing into it stays alive. That is the entire trick — and the reason the teams that internalize it ship model upgrades on a weekly cadence while the teams that don't are still trying to figure out whether last Thursday's deploy regressed anything.
- https://www.langchain.com/articles/llm-evals
- https://www.langchain.com/articles/agent-observability
- https://www.braintrust.dev/articles/best-llm-tracing-tools-2026
- https://www.braintrust.dev/articles/llm-evaluation-guide
- https://deepeval.com/docs/evaluation-llm-tracing
- https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://mlflow.org/docs/latest/genai/eval-monitor/running-evaluation/traces/
- https://arize.com/blog/guide-to-trace-level-llm-evaluations-with-arize-ax/
- https://www.statsig.com/perspectives/stratified-sampling-ai-metrics-reliability
- https://www.statsig.com/perspectives/golden-datasets-evaluation-standards
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://datagrid.com/blog/4-frameworks-test-non-deterministic-ai-agents
