Skip to main content

One post tagged with "capacity"

View all tags

Your Eval Suite Is a Production Workload: When Nightly Tests Starve Live Traffic

· 11 min read
Tian Pan
Software Engineer

A team's most successful AI feature went dark at 2:14 AM on a Tuesday. The pager said the model API was returning 429s in steady state. The model was healthy. The provider was healthy. The team's own production traffic was nominal. What was eating the quota was the nightly eval suite — the same suite the team had been proudly expanding the previous week. The eval and the product shared an organization key, and on that night the eval was the noisy neighbor that broke its own roommate.

The eval wasn't misbehaving. It was doing exactly what its authors designed: a thousand cases against the production model identifier, on a cadence, on a schedule everyone had forgotten about because it had been quiet for two years. The expansion that finally pushed it over the limit added three hundred cases. The PR was reviewed by the eval owner and the prompt owner. Nobody on the review thread thought to ask: how much of the daily token quota does this consume?