When Marketing Reads Your Eval Cases: The Cross-Functional Visibility Problem
The eval set is the most-read artifact your AI team produces, and you almost certainly don't know who's reading it. The repo is private, the CI job is internal, the file is one directory above prompts/ — and yet a sales engineer scraped six cases for a demo last quarter, a marketing analyst pulled three failure cases into a "look how robust our system is" deck, customer success cited eval pass-rates verbatim in a renewal call, and product treats the file as the hidden spec the AI team won't share. The case files are read by more people than the code that generated them, and nobody on the AI team has noticed.
This isn't a permissions failure. The eval set is on the same Git server as the rest of the codebase, with the same access controls as every other engineering artifact. The problem is that the AI team is the only group that treats the eval set as code. Everyone else treats it as documentation, as marketing material, as a product spec, or as a customer complaint log — and each of those readings extracts a different slice of the same file, packages it for a different audience, and ships it somewhere the AI team isn't watching.
The fix isn't to lock the file down harder. The fix is to recognize that the eval set has become a cross-functional artifact whose audience the AI team underestimated, and to build the access and redaction layer before procurement discovers it.
What an eval case actually contains
Strip away the framing and an eval case is four things glued together: an input, an expected behavior, an actual behavior, and a verdict. Each of those four pieces is a leak surface, and the leaks compound when the case is read by someone outside the engineering frame.
The input is almost always a real user message. Even when the team swears the eval set is "synthetic," the inputs that catch real regressions are the ones lifted verbatim from production traces — the awkward phrasing, the partial sentence, the customer name in the salutation, the account number in the body. Synthetic inputs are useful for coverage; production inputs are what actually pass and fail. The synthetic ones don't catch the regressions the real ones do, so the eval set drifts toward verbatim production over time.
The expected behavior is, in practice, a small product spec written by an AI engineer. It encodes assumptions about what the feature should do, often in language that hasn't been reviewed by product or marketing. "The assistant should not commit to a specific delivery date" is an eval expectation that, read by a sales engineer, becomes "the product can't quote delivery dates," which becomes a customer-facing limitation that nobody wrote into the product page.
The actual behavior is a model output, which means it contains whatever the model said — including hallucinated competitor names, made-up statistics, and confidently-wrong factual claims that the eval is precisely there to flag. Read as a failure case by an engineer, this is the bug being guarded against. Read as a screenshot by a marketing analyst hunting for content, this is "the model said X" — without the "and the eval caught it as wrong" context.
The verdict is the most-misread element. A judge score of 3/5 with a comment "fails to cite source" reads, to an engineer, as a measurable regression to drive a prompt change. Read by a customer success manager preparing a renewal deck, it reads as "we admit our product fails 40% of the time on citations" — which is materially different from what the engineer meant by the same number.
The seven readers nobody invited
The AI team's eval repo has more readers than its CODEOWNERS file suggests. Each of them extracts a different artifact, and each extraction creates a downstream obligation the AI team didn't sign up for.
- Sales engineers scrape cases for live-demo material. The inputs are realistic and the model behavior is interesting, so the cases make better demo prompts than anything the demo-content team wrote. The leak: a prospect asks "where did you get this question" and the answer is "from a real customer's eval case."
- Marketing pulls failure cases for content. Counter-intuitively, the failures get more pull than the passes, because "we caught and fixed X" is a better story than "X works." The leak: the failure becomes a public artifact even though the team's stance on the underlying behavior is still being argued internally.
- Customer success cites pass-rates and named cases in renewal conversations. The eval suite's headline numbers leak into procurement comparisons. The leak: customers start asking for the eval methodology, the eval set composition, and the per-customer eval performance — none of which the AI team has built a sharing layer for.
- Legal reads the eval set during IP and PII review. The reading is appropriate, but legal often discovers the eval set late, after customer identifiers have been embedded in dozens of cases and the cleanup is a tax on the team. The leak: surprise compliance debt that lands on the AI team during the worst possible week.
- Product treats the eval set as the spec the AI team won't share. The expected behaviors in the eval cases are the most precise statement of what the feature does that exists anywhere, and product knows it. The leak: feature decisions get made against expectations the AI team wrote casually, without realizing they were authoring product policy.
- Sales engineering during a procurement evaluation is the most dangerous reader. A prospect's RFP includes "share your evaluation methodology and a representative sample of test cases," and someone in sales engineering reaches into the repo and sends the cases they think are most impressive. The leak: forward-looking product plans encoded in eval cases ship to a procurement chat.
- Other AI vendors during a competitive deal is the second-most-dangerous reader. If your sales engineer is sharing cases, your competitor's sales engineer is reading them, and the cases now encode your competitive positioning ("we evaluate against Opus 4.6, not 4.5") and your judgment about your own weaknesses.
The engineering set vs. the shareable set
The right architecture isn't tighter access control on a single eval set. It's two eval sets, with an explicit, versioned pipeline between them.
The engineering eval set is the team's working copy: raw customer traces, internal jargon, unredacted failure cases, competitive references, forward-looking feature names, and the messy diagnostic comments engineers leave for each other. This set is optimized for catching regressions and is consumed by CI, by judges, and by the team's own iteration loop. It should never leave engineering's repo. Treat it the way you'd treat a production database dump — useful internally, dangerous externally.
The shareable eval set is a curated, redacted, narratively-clean subset suitable for non-engineering consumption. It's a smaller set — maybe 10–20% of the engineering set — chosen for representativeness rather than coverage. Customer identifiers are stripped, internal feature codenames are replaced with the public-facing names, competitive references are removed, and failure cases are accompanied by the "and here's how the system handles it" context that makes the case readable as a quality story rather than a weakness. This set is what sales, marketing, customer success, and procurement see.
The split looks expensive until you notice that you're already paying for it — just unevenly, on whatever week the procurement team asks for cases and someone scrambles to redact a hundred cases by hand. The shareable set turns that scramble into a maintained artifact. The engineering set stays raw because the people consuming it (engineers, CI, judges) need it raw.
Between the two sets sits a redaction pipeline. The pipeline is mostly mechanical: regex for known customer-identifier patterns, named-entity recognition for personal names and account numbers, a denylist of internal codenames and unreleased feature flags, and a competitive-reference scrub that catches mentions of named competitors and named model versions. Modern PII redaction tooling has gotten good enough that a pipeline run takes seconds per case; the bottleneck isn't the redaction, it's deciding which cases qualify for the shareable set in the first place.
The request-and-approval workflow
Even with a shareable set, you'll get requests for engineering-set material. Sales has a deal where the prospect wants to see "the actual customer cases you regressed against." Marketing wants a specific failure that happened to a high-profile account because it's a better story than the curated case. Customer success wants the per-customer pass rate broken out for a renewal deck.
These requests can't be handled by access control alone, because the requestor often has a legitimate need and the eval set has a legitimate answer — the question is the framing. The AI team needs to be the framing owner, not the access-control bottleneck.
The workflow that works is a lightweight ticket: requestor names the audience, the use case, and the desired framing; AI team chooses the case (or composes a representative one), drafts the narrative around it, and either ships the result through the shareable set or marks the request as out-of-policy with a written rationale. This sounds heavy, but in practice the ticket volume is 5–10 per quarter for most teams, and the cost of not having the workflow is the eval-case leak that lands in a customer's procurement chat.
The workflow also creates an audit trail. If an eval case shows up in a customer-facing artifact, the AI team can trace which ticket authorized it, what redaction level applied, and which narrative framing was sanctioned. Without the workflow, the AI team finds out about the leak on Twitter.
Versioning and freshness
Eval cases age. A case that flagged a model behavior in Q1 may describe behavior that's been since fixed, behavior that's been since changed, or behavior that's been since productionized. A sales engineer pulling cases from six months ago is showing prospects a snapshot of a model the team replaced two migrations ago, and the inferences they're drawing about current model quality are wrong.
The shareable set needs explicit versioning: a date, a model version, a prompt version, and a redaction level. When a case is shared externally, it carries those metadata fields with it, and the artifact's freshness becomes legible to its consumers. This protects the team in two directions — old "we fail at X" cases don't haunt the current product narrative, and old "we pass at Y" cases don't get cited as current evidence after the underlying behavior has drifted.
A version-pinning convention also lets the team retire shareable cases when the underlying behavior changes, rather than letting a deck from 2024 cite a case the model couldn't reproduce today.
The architectural realization
Eval sets are a cross-functional artifact whose audience the AI team underestimated. The case files that pass the eval gate are read by sales, marketing, customer success, legal, and product before they're read by the next engineer, and the team that doesn't build the access and redaction layer is going to discover the readership the hard way — in a procurement meeting, a competitor's slide deck, or a customer's compliance review.
The fix is two artifacts with an explicit pipeline between them, a request workflow for the exceptions, and a versioning convention that keeps the shareable artifact legible. The cost is modest. The cost of not doing it is paid in trust calibration with customers, in surprise comms work for sales, and in the slow drip of leaks that the AI team will keep discovering long after the leak has stopped being interesting to anyone except the customer who recognized their own complaint.
Treat the eval set the way you treat any other shared artifact with a wider audience than its authors intended: with a published version, a stated audience, and a redaction layer that someone owns by name. The team that publishes a shareable eval set on a cadence — versioned, dated, audience-framed — gets to control the narrative around their AI feature. The team that doesn't gets to read the narrative their sales engineer assembled, in their customer's procurement chat, the week before renewal.
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://www.databricks.com/blog/ai-governance-best-practices-how-build-responsible-and-effective-ai-programs
- https://predictionguard.com/blog/pii-detection-redaction-llm-pipelines-regulated-industries
- https://blog.meganova.ai/pii-redaction-in-ai-systems-why-its-non-negotiable-in-2026/
- https://www.tonic.ai/guides/understanding-automated-data-redaction
- https://openai.com/index/introducing-openai-privacy-filter/
- https://www.braintrust.dev/articles/best-llm-evaluation-platforms-2025
- https://www.getmaxim.ai/articles/ai-governance-best-practices-for-enterprise-teams/
- https://www.upguard.com/blog/dont-use-production-data-in-your-test-environment
