Skip to main content

One post tagged with "production-llms"

View all tags

The Synthetic Eval Your Real Users Never Resemble

· 10 min read
Tian Pan
Software Engineer

There is a class of eval failure that no dashboard catches because it shows up as success. The score climbs week over week. The judge agrees with the answer. The regression tests stay green. Meanwhile, the support team is logging a slow drift in user-reported quality, sales is hearing "it doesn't quite get what I meant," and nobody in engineering can reproduce the complaint because every example anyone tries on the eval set passes. The eval and the users live in different distributions, and the eval is the more polished of the two.

The mechanism is simple, and it hides in plain sight: the model that wrote your eval prompts and the model under test are siblings, and siblings share priors. They smooth the same edges, prefer the same phrasings, leave out the same kinds of malformed input. The eval certifies behavior on a world the generator imagined users have. Your actual users live somewhere else.