Eval Datasets Are Customer Data With a Right Answer Attached
Your golden eval set is a privacy boundary your security team didn't know existed. It is built by sampling production traces, which means it is a curated collection of real customer queries — often containing names, emails, account numbers, transcripts of frustrated calls, half-typed credit card digits — paired with the canonical correct response on top, and then committed to whatever bucket the eval pipeline reads from.
That last part is what makes eval data uniquely dangerous. A raw production trace is sensitive because it captures what the customer said. An eval case is sensitive in a new way because it captures what the customer said plus the labeled correct answer. The label is a derivative work that someone, often an annotator or a domain expert, applied with intent. It signals "this is canonical." It gives the trace a longevity that the original log never had — log retention will eventually rotate the trace out, but the eval case is now a permanent test fixture that the team is committed to keeping green.
Most teams treat eval datasets the way they treat unit test fixtures: shared engineering artifacts, lightly access-controlled, sometimes checked into the repository, sometimes pasted into the appendix of a methodology blog post. None of those handling patterns survive contact with the question "is this customer data?" The answer is yes, and the implications cascade through storage, access, retention, deletion, and contracts in ways that the team usually has not budgeted for.
The Privacy Bypass Hiding in the Eval Pipeline
Production data lives behind real access controls. The customer support tool requires SSO and an audit trail. The data warehouse has row-level security tied to tenant ID. The on-call engineer who pulls a trace to debug a P1 has to justify the read in a ticket. None of that is heroic, it's just the baseline maturity that any company past Series B has been forced to develop.
The eval pipeline routes around all of it. Here is the typical path. A traceability platform — Langfuse, LangSmith, Helicone, an in-house wrapper — captures every prompt and completion. An engineer reviews recent traces, finds the ones where the model misbehaved, copies them into a Google Sheet with a "correct answer" column, hand-labels a few hundred, and exports them as JSON into the team's shared eval-data directory. That directory is on engineering-shared permissions, which means everyone with a laptop can read it. The CI job loads it on every model bump. The same JSON ends up in a contractor's local checkout because they were debugging a flaky regression.
At no point in that chain does anyone re-evaluate whether the data should still be reachable by everyone who can clone the repository. The eval set inherits engineering-team access, not customer-data access. The classification got dropped at the boundary between "production trace" and "test fixture," and once it's labeled as a fixture, the muscle memory takes over: fixtures are checked into the codebase, fixtures are shared with vendors during integration, fixtures are copied into the slide deck when you are explaining how regression testing works at the all-hands.
The architectural failure is structural, not malicious. The team built a system in which the moment a trace becomes a "test case," it stops being treated as customer data. Every other privacy control in the company is downstream of a classification that the eval pipeline silently strips off.
Where Eval Sets Actually Leak
The leak surface is wider than most teams imagine. Here are the recurring patterns.
The contractor laptop. A consultant is brought in to fix a flaky CI suite. They clone the repository, the eval JSON comes with it, and now real customer queries are sitting on a machine that the company has no inventory of, no remote-wipe authority over, and no offboarding process for. When the engagement ends, nobody knows whether the eval set was deleted, because nobody knows it was ever there.
The model card appendix. The team writes up a benchmark methodology for an internal blog post or a public paper. To make the methodology concrete, they include "10 representative examples." Two of those examples have customer names that were supposed to have been redacted but weren't, because nobody applied a PII pass before publication — the eval set had been treated as engineering data, and engineering data doesn't get a redaction review.
The fine-tune training corpus. This is the most expensive failure. Eval data and training data are stored in adjacent buckets, with adjacent naming conventions, and a fine-tune script written by a hurried engineer globs the wrong directory. The model is now trained on its own test set. Beyond the methodological catastrophe of contaminated benchmarks (which the 2025 Kernel Divergence Score work on dataset leakage formalizes as a measurable problem), the legal posture is worse: the customer queries that were captured "for the purpose of providing the service" have now been used to train a model — a different category of processing entirely.
The benchmark blog post. Marketing wants a story. The applied team produces "How Our Agent Beats GPT-5 on Real-World Customer Support Queries," and the verbatim examples are pulled from the eval set. They are technically anonymized — names changed, account numbers masked — but the underlying structure of the conversation is intact. Anyone who knows the customer can recognize their own conversation with the support team. This has happened to multiple companies, and the typical post-mortem involves a takedown notice and a hard conversation with the customer's privacy office.
The common thread is that none of these are exotic exfiltration scenarios. They are everyday engineering activities — debugging, writing up methodology, training a model, telling a marketing story — that go through the eval set without a control gate, because the team did not build a control gate.
The Discipline That Actually Has to Land
- https://gdprlocal.com/gdpr-machine-learning/
- https://www.techpolicy.press/the-right-to-be-forgotten-is-dead-data-lives-forever-in-ai/
- https://www.edpb.europa.eu/system/files/2025-01/d2-ai-effective-implementation-of-data-subjects-rights_en.pdf
- https://openai.com/policies/data-processing-addendum/
- https://customgpt.ai/data-processing-agreement-ai-vendor/
- https://arize.com/resource/golden-dataset/
- https://www.getmaxim.ai/articles/building-a-golden-dataset-for-ai-evaluation-a-step-by-step-guide/
- https://sigma.ai/llm-privacy-security-phi-pii-best-practices/
- https://www.protecto.ai/blog/llm-privacy-compliance-steps/
- https://langfuse.com/docs/observability/features/masking
- https://www.tonic.ai/guides/llm-data-privacy
- https://arxiv.org/abs/2502.00678
- https://amanpriyanshu.github.io/SynthLeak/
