Skip to main content

The Internal Eval Set Is a Privacy Boundary Nobody Reviewed

· 11 min read
Tian Pan
Software Engineer

The dataset your AI team calls "the eval set" is, in most companies shipping LLM features, a collection of real customer conversations pulled from production logs. Nobody on the team thinks of it as a privacy event. The data never left the cluster. No new system was provisioned. No vendor was added. An engineer wrote a query, exported a few thousand traces into a labeling tool, and the team started grading model outputs against them. The legal team never heard about it because, from the inside, nothing changed — the same conversations that already lived in the same database were now also being read by a few engineers and a judge model.

That is the privacy boundary nobody reviewed. Customers gave you their messages so you could answer them. They did not give you their messages so you could measure your model against them. The two uses look identical at the storage layer and feel identical at the inference layer, but they are different processing purposes under every modern privacy regime — and the gap between the two is where the next round of compliance pain is going to land.

The reason the boundary is invisible from the engineering side is that the data physically does not move. A row in the conversations table that was written by the chat handler at 2am is read by the eval script at 11am. The same row. The same database. The same access controls. The only thing that changed is the reason someone is reading the data, and "reason" is not a property the engineering stack tracks. Purpose limitation is a legal concept that lives in a column that does not exist in your schema.

Purpose limitation is the rule the engineering org never internalized

The principle that personal data collected for one purpose cannot be silently repurposed for another shows up under different names in different regimes — purpose limitation in GDPR, the "specific, explicit, and legitimate purposes" clause in most state privacy laws, the FTC's repeated guidance on AI companies upholding their privacy commitments — but the engineering implication is the same. The lawful basis on which you collected the customer's data was "provide the service the customer asked for." Using the same data to evaluate, tune, calibrate, or benchmark an AI system is a separate purpose. It needs its own legal basis, and the legal basis needs to be either disclosed in the notice the customer saw or supported by legitimate interest with a documented balancing test.

Engineers rarely encounter this rule because the data engineering tooling does not surface it. The query planner does not warn you that joining conversations against eval_cases is a purpose change. The vector store does not know that "find similar examples for the few-shot prompt" and "find similar examples for the eval set" are different uses of the same embedding. The trace viewer treats a customer message and a labeled eval example as the same kind of object. The infrastructure flattens a distinction that the law treats as load-bearing.

The result is the pattern that the legal team eventually walks into: the AI team has spent six months refining a judge against a corpus of "internal traces" that, on inspection, is composed almost entirely of identifiable customer conversations, some of which contain payment details, health information, or material the customer marked as private inside the product. None of those customers consented to their conversations being used as evaluation material, because no one ever asked.

The cluster boundary is not a privacy boundary

The most common defense from the AI team, when this gets surfaced, is some version of "but the data never left the cluster." This argument is a misreading of what the privacy regime is actually regulating. Privacy law does not regulate egress; it regulates processing. Reading the data with a new purpose is processing. Hashing the user_id and copying the row into an evals table is processing. Letting a judge model score the response is processing. Letting an engineer scroll through twenty examples to debug a regression is processing. The fact that none of these operations sent a packet across a VPC boundary is irrelevant to the legal question.

This is also where the SOC 2 and ISO 27001 controls that the company already has give a false sense of coverage. Those controls were designed around access management for the original purpose. They answer the question "is the right person reading this data?" They do not answer the question "is this person reading this data for a purpose the customer authorized?" The eval engineer almost certainly has lawful access to the database under the existing role model. That does not mean the use is lawful.

The internal nature of the dataset also tends to short-circuit the DPIA process. A Data Protection Impact Assessment is supposed to happen before any high-risk processing of personal data begins, and using identifiable customer conversations to evaluate a model is a textbook case of high-risk processing — large volume, special categories possible, automated decision-making downstream. But because nothing felt new from the inside, no DPIA was triggered. The form on the privacy team's intake portal asks "are you adding a new vendor?" and the answer was no.

The trigger event is always a customer question, never an audit

The interesting thing about this anti-pattern is how it usually surfaces. It is almost never an audit that catches it. Audits look at vendor lists, data flow diagrams, and consent records. Internal evaluation pipelines are invisible on all three. The triggering event is almost always a customer question — sometimes a DSAR, sometimes an enterprise customer's procurement team asking whether their tenant's conversations are used in model evaluation, sometimes a single curious user emailing support to ask whether the AI is being trained on their messages.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates