Skip to main content

Your Shadow Eval Set Is a Compliance Time-Bomb

· 10 min read
Tian Pan
Software Engineer

The most dangerous data store in your AI stack is the one nobody designed. It started with a Slack message during a sprint: "Real users are the only thing that catches real bugs — let's tap a percentage of production traffic into the eval pipeline so we can replay it nightly." Six engineers thumbs-upped the message. Nine months later, the bucket holds 4.3 million traces, an eval job pages the on-call when failure rates rise, and the failure cases are emailed verbatim to a Slack channel where forty people can read them. The traces include email addresses, internal company names, partial credit-card digits, employee phone numbers, and customer support transcripts where users explained why they were upset.

Nobody mapped the data flow. No DPIA covered it. The privacy review last quarter looked at the model vendor's API; it didn't look at your eval job. And then a data-subject deletion request arrives, and the team discovers that "delete this user's data everywhere" is a sentence that no longer maps to anything they can actually do.

This is the shape of the most common AI compliance incident I see in 2026. Not a model leak. Not a vendor breach. Not a prompt injection. A bucket of production transcripts, named eval-prod-traffic-v3, accumulating sensitive data nobody promised users would be processed this way.

The tap that nobody thought of as data processing

Every team that ships an LLM-backed product eventually rediscovers the same insight: synthetic evals miss the failure modes that matter, because the failure modes that matter come from queries you didn't anticipate. A user asks the agent to compare three subscription tiers using a screenshot of their own invoice. Another user pastes three paragraphs of an angry customer email and asks the agent to draft a response. Neither of these queries lives in your golden eval set.

So someone wires a tap. Production traces — full prompts, tool calls, tool responses, model outputs — get duplicated into a parallel pipeline. The pipeline scores each trace on a few metrics, surfaces regressions, and stores the worst failures for human review. The engineering team thinks of this as evaluation infrastructure. The legal team, if they ever heard about it, would think of it as an entirely new processing activity that materially expands the scope of what the company does with user data.

Those two framings never collide because the pipeline doesn't sit on the legal team's diagram. It got added to a Terraform module called observability-internal and inherited the IAM policy of the team that owns the metrics dashboard. The retention is whatever the underlying object store defaults to, which is forever.

The data-subject request that breaks the abstraction

The first time a deletion request lands, the team treats it as an exercise. The user-facing application has a clean delete path: drop the row, drop the chat history, invalidate the session. Done in twenty minutes. Then someone remembers the eval bucket.

This is where lossy anonymization stops looking clever and starts looking expensive. A common pattern is to run a redaction step before persisting traces — scrub email patterns, mask digits, replace names with placeholders. The redaction is irreversible, which feels safe. But irreversible redaction also means there is no surviving join key from a trace back to the user it came from. You can search the bucket for the user's email, find nothing, and conclude that the deletion is satisfied — except the trace still contains the user's question about a medical condition, the company they work at, and a screenshot of an invoice that the redactor missed because PII detection on freeform text is genuinely hard.

The team that did the redaction made a defensible engineering choice. The team that has to honor the deletion request now has a defensible legal problem: they cannot prove the data is gone, because they cannot prove what was in it.

The retention promise that was never written down

Most evaluation pipelines accumulate data on a "default to keep" basis. The engineers who built the pipeline assume the data is bounded by storage cost — old traces will eventually fall out when someone runs out of budget. The privacy notice the company shows users says something like "we retain your interaction history for 30 days for service quality purposes." Those two sentences disagree, and the eval bucket is where the disagreement compounds.

The cleaner pattern is one most observability vendors already support: tag-based retention enforced at the storage layer, not at the application layer. Datadog, Splunk, and Lightstep each let you keep a small slice of high-value traces (say, error traces or sampled successes) for 30 days and discard everything else within hours. The same primitive applies to your eval bucket: the engineering question is not "how long should we keep these traces" — it is "which traces do we keep, and what is the rule that drops the rest by default."

Without that rule, the default is forever, and forever is a promise no privacy notice can support.

Sanitize at the trace layer, not at eval time

The architectural fix is to push sanitization upstream of the eval bucket. Two patterns work; a third one looks like it works and doesn't.

Pattern that works: deterministic tokenization at write time. When a trace is recorded, sensitive entities are replaced by stable tokens generated by a separate service that holds the mapping. The eval bucket sees cust_a3f8 everywhere instead of acme.com; the bucket has no way to recover the original. If a deletion request arrives, the tokenization service drops the mapping, and the eval data becomes unlinkable to the user — which, under most regulators' reading, satisfies the request without rewriting the eval bucket. This is the closest thing to an honest answer to "we evaluate against production patterns, not production data."

Pattern that works: schema-typed spans with classifications enforced at write. Each span attribute carries a sensitivity class — public, internal, pii, regulated — declared in the instrumentation library. The trace pipeline refuses to persist regulated-class attributes outside of the regulated bucket and rejects unclassified attributes at the gateway. This shifts the privacy decision from "did the redaction regex catch it" to "did the engineer who added this span field declare what's in it." The latter is auditable; the former is not.

Pattern that fails: regex sanitization at eval-time. The team writes a script that masks email addresses and phone numbers when traces are read out for review. The script runs on read because that's where it's most visible. The script does not run on write, so the raw data sits in the bucket. A six-week DPIA review later determines that the data was processed (collected, stored, transformed) the entire time it sat unredacted in the bucket — the read-side scrubber doesn't change that. This is the shape of the finding that turns a data-handling review into an enforcement action.

The hard conversation is the one between a privacy lawyer who reads the DPIA and an engineer who maintains the eval pipeline. The DPIA says, "We do not use customer interaction transcripts for evaluation purposes." The bucket says otherwise. Both statements were made in good faith. The lawyer never knew the bucket existed. The engineer never knew the DPIA had committed the company to a posture the bucket violates.

This is why DPIAs for AI systems need to enumerate processing activities at the pipeline level, not the product level. The product is "AI assistant for support agents." The processing activities include: inference against the user prompt, persistence of the trace for observability, replay of the trace for nightly evaluation, persistence of failure cases for human review, and notification of failure cases to a triage channel. Each of those is a distinct processing activity with a distinct purpose, recipient, and retention. Bundling them all under "AI assistant" makes the eval pipeline invisible until something goes wrong.

The same discipline applies to the change-control process. GDPR's accountability principle treats material changes to processing as triggers for DPIA review. "We added shadow eval to catch regressions" is a material change. It rarely shows up as one because it does not change the user-visible product, and the people who would notice are not on the deployment review.

The synthetic-data line that's worth defending

There is a defensible position the industry is converging on: evaluate against synthetic data derived from production patterns rather than against production transcripts with names lightly changed. The distinction matters because the legal posture is fundamentally different.

Synthetic-from-patterns means the production traces are used (with consent and DPIA cover) to fit a generator — a small model, a template population, a statistical sampler — and the eval set is drawn from the generator. The eval set contains no individual record. A DSR doesn't apply because no data subject's data is in the eval set. The price is that the synthetic data has to be evaluated for fidelity, utility, and privacy risk in its own right; this is a real tax. Frameworks like SynEval and SynthTextEval exist precisely because that evaluation is non-trivial.

Lightly-changed transcripts are the opposite. The eval set still contains the original prompts and the original tool responses, with names swapped to placeholders or initials. Re-identification attacks against this kind of data are routine in the membership-inference literature. Calling the result "anonymized" is closer to a vibe than a guarantee. Regulators have started treating the distinction seriously; an enforcement decision against "anonymized" data that turns out to be re-identifiable is much more painful than one against data the company admitted was personal.

What to do this week if your eval bucket exists

Three actions are worth more than the rest combined.

Map the flow. A diagram with arrows from "user request" to "model" to "trace store" to "eval pipeline" to "Slack failure channel" to "engineer's laptop during review" is a one-day exercise that almost no team has done. The diagram is the thing the privacy team needs to assess; it is also the thing that surfaces the email-to-Slack arrow that nobody had thought about.

Stop the bleeding at the write boundary. Whatever sanitization regime you adopt, run it on the write path before the trace lands in the bucket. Anything else is a paper shield. Schema-typed spans are the highest-leverage change because they make the privacy decision explicit at the point of instrumentation, where the engineer actually knows what the field contains.

Set a retention floor by tag class, not by total volume. "Keep 30 days" is not a retention policy; it is a storage budget. "Keep error-class traces for 30 days, success-class traces for 24 hours, regulated-class traces never persist outside the regulated bucket" is a retention policy. The difference shows up in audit; it also shows up in your DSR runbook.

The shadow eval set is the most useful piece of infrastructure most AI teams have built in the last two years. It does catch the bugs that synthetic evals miss. The compliance time-bomb is not the eval set itself — it is the gap between how the engineering team thinks about the eval set and how the rest of the company has to be able to describe it. Closing that gap is the unglamorous, mostly-paperwork, mostly-architecture work that determines whether the next privacy review is a conversation or an investigation.

References:Let's stay in touch and Follow me for more thoughts and updates