Shadow Evals: When Private Slices Replace Your Eval Rollup
The fastest way to discover that your AI team has no eval discipline is to ask three engineers, in separate Slack DMs, "did your last prompt change improve quality?" — and watch them answer yes, all three of them, with three different numbers, against three different slices, on three different laptops, none of which is reproducible by anyone else in the room. That isn't an evals problem in the textbook sense. The textbook says you don't have evals. The reality is worse: you have too many evals, each of them privately owned, each of them measuring something real, and none of them rolling up into a single number the org can plan against.
This is the shadow eval anti-pattern, and most AI teams ship with it for longer than they admit. It looks productive — every engineer has a notebook, every PR comes with a screenshot of a pass rate, every standup mentions a "win on the long-tail slice" — and it survives quarterly reviews because the bar for "we do evals" is so low that running anything counts. But the org has no signal. Leadership cannot tell whether last month's three prompt edits moved the product forward or sideways, because the three engineers measured against three private slices and stopped tracking the previous baseline the moment they switched files.
The pattern doesn't show up because anyone is being lazy. It shows up because the local incentive is so much faster than the global one. An engineer debugging a regression on Spanish-language customer-support queries doesn't want to wait an hour for the full eval suite to spin up against a thousand cases when they can curate fifty cases locally and iterate in three minutes. The fifty cases are correct — they're the cases that matter for the bug at hand. The problem is that nobody else ever sees them, the score against them never lands in the leaderboard, and when the engineer ships the fix the org has no record that those fifty cases ever existed, much less how the new prompt performs against them.
The Telltale Symptoms
You're running shadow evals when the following are all true. PR descriptions cite eval numbers, but the eval that produced them is a script in the author's ~/scratch/ directory. The "official" eval suite — if there is one — gets run once a week by one person, and its dashboard hasn't been refreshed since the last on-call rotation. Engineers reference improvements that happened in private — "yeah, that 3-point lift I got on the financial-services slice last sprint" — but no other engineer can reproduce the number because the slice itself was a CSV that lived on the author's laptop until it got accidentally rm -rf'd in a clean-up.
A subtler symptom: when a regression lands in production, the postmortem cannot identify which eval should have caught it, because nobody knows what was in scope of the official suite. The team has been iterating against private cases that were never promoted into shared infrastructure, so the regression suite has drifted behind the team's actual mental model of "what the product needs to handle." The mental model lives in seven private notebooks. The regression suite is six months stale.
The financial cost is real but secondary. The real cost is epistemic. The org has been making decisions based on stories about evals rather than evals themselves, and the gap between the story and the suite gets wider every sprint.
Why the Local Incentive Wins
A shared eval harness, run correctly, is slow. It includes cases the engineer doesn't care about for the current bug, it spends compute on judge runs the engineer has no signal-need for, and it produces a number that mixes the relevant slice into a global average that obscures the change the engineer is trying to measure. The engineer's private fifty-case notebook is twenty times faster and produces a sharper signal on the actual question. From a pure time-to-iteration standpoint, the private notebook is the rational choice for the engineer's current task.
This is the core trap. The private eval is locally optimal — better signal-to-noise on the exact failure mode the engineer is debugging — and globally pessimal — invisible to the rest of the org, not promoted into the shared baseline, lost the moment the engineer switches contexts. Teams that don't actively pay the cost of promoting private slices into shared infrastructure end up with a graveyard of locally-correct decisions and no global picture.
The version-control analog is unowned scripts in everyone's home directory. The codebase analog is dead branches that never merge. The data-engineering analog is the Excel spreadsheet that runs the business. They all describe the same failure mode: real work is happening, real value is being produced, none of it accrues to the shared substrate, and the org pays for that drift in slow motion.
The Discipline That Closes the Gap
Closing the shadow-eval gap is not a tooling problem first — it is a contract problem. The contract is: every eval result that influenced a merge decision must be reproducible by another engineer on another machine using only what's in the repository. That sentence does almost all the work. If you take it seriously, several things follow:
A shared harness with a leaderboard. One process that runs the canonical suite against a registered model+prompt+config tuple and writes the result to a place everyone can see. Tools like promptfoo, Braintrust, the lm-evaluation-harness pattern, and homegrown variants all converge on this shape — a YAML or config file that specifies dataset, prompt template, judge, and scoring rubric, version-pinned so that the same config plus the same commit hash produces the same number. The number is not the goal; reproducibility is. The leaderboard makes the global state legible at a glance: who moved which slice in which direction this week.
Mandatory PR-attached eval runs. When an engineer opens a PR that touches a prompt, a tool definition, a model identifier, or anything else that could move quality, CI runs the shared eval suite against the proposed change and posts the diff against the current baseline. The diff is a merge gate. The merge gate cannot be argued with by a screenshot from someone's laptop, because the screenshot isn't reproducible. This is the policy lever that actually disciplines behavior — engineers stop running private evals as their primary signal because the gate keeps catching things their private slice missed.
Blessed slices that everyone reproduces. The eval suite is not a monolith. It is a small number of named slices — the long-tail Spanish queries, the multi-turn refusal cases, the new-feature regression set, the financial-services rubric — each owned by a named human, each promoted from what used to be a private notebook, each versioned and visible. When an engineer discovers a new failure mode in the wild, the promotion path is what matters: the case gets added to a blessed slice, the slice owner reviews it, the leaderboard now reflects the new ground truth, and the rest of the team starts iterating against it within a day. Without a promotion path, every private case is a private case forever.
Provenance metadata on every case. Each case in a blessed slice carries a short note: what failure mode it guards, what incident or PR added it, who owns it. This sounds bureaucratic until the eval suite has ten thousand cases and someone wants to know whether a particular regression is allowed to land. With provenance, you can answer that question in thirty seconds. Without it, you do a six-week archaeology project against Git blame and an old Slack channel.
The Half-Step That Doesn't Work
The most common attempt at fixing shadow evals fails because it stops short. A team will adopt a shared eval harness, set up a leaderboard, and then not gate PRs on it. The engineers keep their private notebooks because the private notebook is still faster, the shared harness is still optional, and the only thing that changed is that there's now an unmaintained dashboard the team feels guilty about. Six months later the dashboard's last update is from the launch week and the team is still running shadow evals — they just have a tool they're not using.
The gate is what disciplines behavior, not the tool. Tools without gates are a tax on the people who use them and an alibi for the people who don't. If you're not willing to block merges on the shared eval, you're not actually running it; you're staffing it.
A related half-step: gating PRs on a single shared eval number instead of the per-slice breakdown. This rewards regressions that hurt one slice while improving the global average, which is exactly the kind of trade-off engineers should be making explicitly, with sign-off from the slice owner, not silently by virtue of the math. The PR comment should show the diff per slice and require an owner sign-off if any slice regresses, not just print a single composite score.
What Good Looks Like Six Months In
A team that has actually closed the shadow-eval gap looks different in three observable ways.
First, when you ask an engineer why they made a particular prompt change, they cite the leaderboard URL and the slice that moved. They do not screenshot a notebook. The answer is reproducible by you on your laptop in the next ten minutes.
Second, when a regression lands, the postmortem identifies the eval gap precisely: "this failure mode was not covered by any blessed slice; we're adding it to the multi-turn-refusal slice and the slice owner has signed off." The team treats eval coverage as a first-class kind of technical debt, with the same weight as missing unit tests in a service-engineering team.
Third, the leaderboard is boring. Most weeks it moves by fractions of a point. The boring is the signal — it means the team has stopped relying on dramatic private wins and started shipping incremental shared progress that compounds. The boring leaderboard is what a healthy AI engineering org looks like. The dramatic private win, screenshotted in Slack, is what an unhealthy one feels like from the inside, which is exactly why teams stay in the trap as long as they do.
The shift from private notebooks to shared rollup is not a tooling upgrade; it is a change in what counts as evidence inside the team. Until eval numbers from a private notebook are inadmissible — politely, but firmly — engineers will keep running them, because the local incentive will keep being faster than the global one. The discipline is the merge gate, the blessed slice, the named owner, and the leaderboard URL that anybody can rerun. The tools are the easy part. The contract is the hard part, and the contract is what actually closes the gap.
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://hamel.dev/blog/posts/evals-faq/
- https://newsletter.pragmaticengineer.com/p/evals
- https://www.braintrust.dev/articles/eval-driven-development
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://github.com/EleutherAI/lm-evaluation-harness
- https://www.promptfoo.dev/docs/integrations/ci-cd/
- https://www.statsig.com/perspectives/shadow-testing-ai-model-evaluation
- https://latitude.so/blog/ultimate-ci-cd-llm-evaluation-guide
- https://www.evidentlyai.com/blog/llm-unit-testing-ci-cd-github-actions
