The Eval Set Your Prompt Engineers Turned Into Production Few-Shots
The eval dashboard had been climbing for three sprints. Quality up six points on the hard slice, up nine on the regression slice, up twelve on the slice the support team had hand-curated from last quarter's worst tickets. The team shipped a model promotion off the back of it. Two days later, a customer asked a question that looked nothing like anything in the eval set, and the answer was worse than what they had been getting six months ago.
The forensic was quick once someone thought to run it. The prompt engineers had been working out of the same repo as the eval team. They had found the curated examples — the painstaking ones, the ones where someone had argued for an hour about the correct phrasing of the ideal answer — and over a few sprints they had copy-pasted the strongest of them as few-shot demonstrations into the production system prompt. The dashboard kept going up because the model was being graded on inputs it had seen verbatim at inference time. Nobody flagged it. Nobody owned the boundary between "the examples we measure quality against" and "the examples we ship in the prompt." Both teams were doing exactly the job they had been hired to do.
