Your Eval Harness Is a Museum: How Production Failures Should Write Tomorrow's Tests
Most AI teams build their eval suite once — carefully, thoughtfully, during the sprint before launch. They write cases for the edge scenarios they can imagine, document the expected outputs, get sign-off, and ship. Six months later, the suite still passes. The model has quietly gotten worse on the actual traffic hitting production, but the eval harness was authored before any of that traffic existed. It's still grading the answers to questions the author asked, not the questions users are asking.
That's the museum problem: an eval suite curated at one point in time accumulates relics. It proves the system handles the cases someone anticipated, not the cases that actually break it.
