Skip to main content

Your Eval Suite Is a Production Workload: When Nightly Tests Starve Live Traffic

· 11 min read
Tian Pan
Software Engineer

A team's most successful AI feature went dark at 2:14 AM on a Tuesday. The pager said the model API was returning 429s in steady state. The model was healthy. The provider was healthy. The team's own production traffic was nominal. What was eating the quota was the nightly eval suite — the same suite the team had been proudly expanding the previous week. The eval and the product shared an organization key, and on that night the eval was the noisy neighbor that broke its own roommate.

The eval wasn't misbehaving. It was doing exactly what its authors designed: a thousand cases against the production model identifier, on a cadence, on a schedule everyone had forgotten about because it had been quiet for two years. The expansion that finally pushed it over the limit added three hundred cases. The PR was reviewed by the eval owner and the prompt owner. Nobody on the review thread thought to ask: how much of the daily token quota does this consume?

That question — what fraction of the production quota does this offline workload consume? — is the question that distinguishes teams who treat eval as a side activity from teams who treat eval as production. The former end up writing post-mortems. The latter end up writing dashboards.

Eval is not a unit test

In most engineering cultures, the word "test" pre-frames the resource model. A unit test runs in CI on a hosted runner whose CPU is bought-and-paid-for by your platform team. The marginal cost of one more assertion is the rounding error on a CI invoice. Engineers learn early to write tests with abandon, because the resource the test consumes is fungible and abundant.

An LLM eval inherits the word "test" but inherits none of the resource model. The marginal cost of one more eval case is a real provider invoice line, and — crucially — it draws against the same quota pool as the product that pays for the company. Provider rate limits are typically per organization, not per key. Creating a second API key under the same org does not double your TPM; it splits the existing pool. The engineer who mentally models eval cost as "free, like a unit test" is making a category error that the billing system politely declines to correct.

The quota model is multi-dimensional. Providers commonly enforce four independent dimensions simultaneously: requests per minute, tokens per minute, requests per day, and tokens per day. A workload that fits comfortably under TPM can still breach RPD. A nightly eval that runs in a fifteen-minute window is a per-second spike against the per-day budget, which is exactly the dimension hardest to recover from — rolling daily windows don't open back up the way per-minute buckets do.

The shape of the failure

A typical incident timeline looks like this:

  • 02:14 AM: eval worker fires its first case in a 1,000-case batch.
  • 02:14:08 AM: eval has burned 1.8% of daily tokens in eight seconds.
  • 02:18 AM: eval reaches case 280 and the production workload's organic 2 AM traffic — small but non-zero — joins the contention.
  • 02:21 AM: combined usage breaches the daily token ceiling.
  • 02:21 AM to 04:00 AM: every synchronous production request returns 429. The eval itself starts getting 429s too, but the eval's retry loop with exponential backoff continues to consume RPM budget even though TPD is exhausted.
  • 04:00 AM: the rolling window edge advances enough for retries to land.
  • 09:00 AM: an on-call engineer sees the dashboard, traces the cause to the eval, and updates the post-mortem doc with a sentence that begins "the eval suite ran as scheduled and."

The post-mortem then asks the question that should have been asked at design time: why is a non-production workload allowed to consume production capacity? And the answer is always some version of: the team that owns eval correctness and the team that owns provider quota are different, and neither team's review caught the cross-cutting constraint.

What "isolating eval" actually means

The fix is structurally simple and operationally underrated: eval gets its own provider account, or at minimum its own organization, with its own quota pool. The eval account's bad day cannot become the production account's bad day because they are different accounts on the provider's billing system. This is the move that vendor-neutral gateway tools have been quietly making standard practice, and it is the move that teams running directly against a provider API tend to skip because nobody's runbook tells them to.

Concretely, "isolating eval" decomposes into several decisions, and the decisions are not all the same kind:

  • Account-level isolation. A separate provider organization for eval. This is the strongest form, because provider rate limits are enforced at the organization tier. The cost is operational: another billing relationship, another seat to manage, another invoice to reconcile. The benefit is that a runaway eval cannot starve production, full stop.
  • Gateway-level isolation. A virtual key issued by an internal LLM gateway, with a budget cap and a rate cap independent of the production virtual key. The eval workload speaks to the same provider organization underneath, but the gateway throttles it before the provider does. This is cheaper to set up than account-level isolation but leaves a residual risk: at the provider tier, the two workloads still share a pool, and any pool-level limit applies to the sum.
  • Code-level isolation. A worker-pool concurrency cap on the eval, with backpressure into the eval scheduler. This is the weakest form. It assumes the eval author predicts the eval's worst-case quota draw, which is exactly the prediction the original incident proved unreliable.

The strongest form is the cheapest one to reason about and the most expensive one to set up. Most teams stop at the middle option and call it done, which is reasonable as long as the team understands what the gateway cannot protect them from.

The PR review question nobody is writing

When a PR adds three hundred eval cases, what is the structural reason the review thread doesn't ask about quota impact? In most organizations, eval lives in a /evals directory and the diff looks like data — YAML or JSON test cases, sometimes a small change to a runner script. Code review for data files is shallow by reflex. The reviewer is looking for typos, format errors, maybe a duplicate label. The reviewer is not running back-of-envelope math on per-day token consumption.

A useful intervention is to make the math automatic. The eval framework can emit, at the end of every run, a single line of telemetry: this eval cost N tokens; the production daily quota is M; this eval consumes N/M of daily quota. When that number crosses a threshold — 5% is a defensible starting point for a workload that runs once a day, lower for workloads that run more often — the PR is required to obtain a sign-off from whoever owns provider quota. The discipline closes the cross-team gap with a number rather than with a meeting.

The threshold matters less than the existence of a numeric gate. Teams that gate on "the eval owner thinks this is fine" will eventually ship the expansion that breaks production. Teams that gate on "the projected daily quota draw is below 5%" will not, because the number stops being a matter of opinion.

Eval scheduling is a capacity decision

Even with isolated quota, the when of an eval matters. An eval that runs at 2 AM is a polite eval, because 2 AM is when organic traffic is lowest in most time zones. An eval that runs at 2 PM is a hostile eval, because 2 PM is when organic traffic competes for the same per-minute buckets even if the daily budget is fine.

Scheduling is not just about latency politeness; it is about failure-mode locality. If the eval and the product share any capacity dimension, the time-of-day choice determines whether eval failures correlate with product failures or anti-correlate with them. Most teams want anti-correlation: eval should hammer the provider when production is quiet, so that production's bad-traffic-hour quota draw never lands on top of an eval spike.

A defensible scheduling policy has three pieces:

  1. Eval runs in the lowest-traffic window for the product's busiest time zone. This is the cheapest and best-understood lever.
  2. Eval pauses when production quota utilization exceeds a ceiling. A circuit breaker on the eval scheduler that checks the production gateway's recent utilization and defers the next batch if the product is consuming more than, say, 70% of TPM. This costs the eval some completeness — a few nights a quarter the eval will run short — but the trade is that no possible eval expansion can correlate-spike with a traffic surge.
  3. Eval rate-limits itself even when it isn't competing. A self-imposed RPM cap on the eval worker, well below the provider tier limit, so that a bug in the eval (an accidentally repeated batch, a misconfigured loop) cannot draw infinitely from the quota pool before someone notices.

The third item often draws pushback because it slows the eval down. The eval owner wants the result by morning; a self-imposed cap might stretch the run from fifteen minutes to forty. The trade is straightforward: a slower eval that cannot blow up the budget is strictly cheaper than a fast eval that can. Provider invoices are paid in dollars; eval latency is paid in waiting.

The alert that has to exist

Independent of all the prevention above, one piece of observability is non-negotiable: an alert that fires when any non-production workload consumes more than a configurable fraction of the shared quota. The alert sits on the gateway or on the provider's usage dashboard, segments traffic by virtual key or by API key, and fires on a rolling window — not just on the daily total.

The alert is the last-resort safety net. The earlier preventions try to make sure the situation never arises. The alert assumes the preventions will eventually fail — a new workload gets added without going through the review, a virtual key gets misconfigured, a developer copy-pastes production credentials into a one-off script — and gives the on-call a chance to catch it before customers do.

The alert thresholds should be stricter than the provider's own thresholds. If the provider rate-limits at 100% of TPM, the alert should fire at 70%. The point of the alert is to give human attention time to react before the provider stops being polite about it.

The architectural realization

An eval suite that calls a paid API is a production workload whose SLO is must not break the system it grades. A team that has not named that SLO out loud is one eval expansion away from a self-inflicted outage, where the post-mortem will read like a comedy: "the engineering team's quality-assurance process consumed all the quality, leaving none for the customers."

The deeper realization is that "production" was never about whether code ran in a particular environment or whether traffic came from particular users. Production is the set of workloads whose failure customers experience. An eval suite that competes for production quota is, by that definition, production — its outages are production outages, just routed through a less obvious mechanism.

The teams that understand this stop reviewing eval changes the way they review unit-test changes. They start reviewing eval changes the way they review capacity changes: with a budget, a threshold, a sign-off, and a kill switch. The eval becomes a peer to the workload it grades, instead of a guest sharing the room without paying rent.

References:Let's stay in touch and Follow me for more thoughts and updates