The Eval Automation Trap: When Your Pipeline Drifts Away From What Users Actually Want
Your eval pipeline scores are trending up. Response quality is improving. The LLM judge is catching more bad outputs. Your dashboard is green.
Meanwhile, a support ticket trickles in: "The assistant keeps giving me long, formal answers when I asked a simple question." Then another: "It stopped suggesting next steps. Used to do that automatically." Then your product manager shows you a chart: user satisfaction down 12% over the last quarter, correlated almost perfectly with the stretch where your automated eval metrics were climbing fastest.
This is the eval automation trap. Your measurement apparatus became optimized for itself rather than for what your users value — and because the feedback loop was entirely automated, nobody noticed until the damage was already in production.
