Shadow Evals: When Private Slices Replace Your Eval Rollup
The fastest way to discover that your AI team has no eval discipline is to ask three engineers, in separate Slack DMs, "did your last prompt change improve quality?" — and watch them answer yes, all three of them, with three different numbers, against three different slices, on three different laptops, none of which is reproducible by anyone else in the room. That isn't an evals problem in the textbook sense. The textbook says you don't have evals. The reality is worse: you have too many evals, each of them privately owned, each of them measuring something real, and none of them rolling up into a single number the org can plan against.
This is the shadow eval anti-pattern, and most AI teams ship with it for longer than they admit. It looks productive — every engineer has a notebook, every PR comes with a screenshot of a pass rate, every standup mentions a "win on the long-tail slice" — and it survives quarterly reviews because the bar for "we do evals" is so low that running anything counts. But the org has no signal. Leadership cannot tell whether last month's three prompt edits moved the product forward or sideways, because the three engineers measured against three private slices and stopped tracking the previous baseline the moment they switched files.
