Skip to main content

Your Latency SLO Is a Function of Other Teams' Prompt Sizes

· 10 min read
Tian Pan
Software Engineer

Your chat product has been running quietly at a 1.5-second p99 latency SLO for months. The request rate is flat, the prompt sizes are flat, the model has not changed. Then, on a Tuesday afternoon, p99 jumps to 4.8 seconds and stays there. The on-call investigation finds no anomaly in the chat path: same requests-per-minute, same median prompt of around 800 tokens, same retry behavior on the SDK. The deploy log for the chat service is empty for the day. The breach lasts six hours.

The cause is in another team's repo. That morning, a long-document summarization feature shipped on the same organization key, with average prompts of 12,000 tokens. Their request rate is modest — a few hundred per minute — but each call burns through the shared tokens-per-minute budget fifteen times faster than yours. The provider's throttle fires on the chat path because the chat path was holding the same bucket the summarization team just emptied. Nobody changed your code, nobody breached anyone's planned capacity, and your SLO is now a function of a workload your team has never read.

This is what shared token-per-minute (TPM) limits do to multi-product organizations. The provider gives you one denomination of capacity. Your teams plan in another. The accounting gap is invisible until a new workload tilts the ratio.

The Unit Your Provider Throttles In Is Not the Unit You Planned In

The major providers — OpenAI, Anthropic, Google — enforce token-per-minute limits at the organization level, not the API key level. Every key under the same org draws from the same TPM pool. OpenAI documents this directly: all keys created under the same organization share the same RPM, TPM, RPD, and TPD pools. Anthropic enforces rate limits per organization rather than per key, using a token-bucket algorithm that refills continuously up to the org ceiling.

The number you negotiate with the provider — or the tier you graduate into by spend — is denominated in tokens per minute. The TPM is a real budget. Your application's planning is usually not.

Most application teams reason about capacity in two units the provider does not throttle in: requests per minute and dollars per month. RPM is what your load balancer and autoscaler care about. Dollars is what your finance partner cares about. Neither of those is the unit the throttle fires on. A team that holds an RPM budget on a TPM-throttled API has built a capacity plan in the wrong denomination, and the moment another team's prompt sizes change, the plan misses by the ratio of the prompt size shift.

The math is not subtle. If your chat workload runs at 1,000 RPM with 800-token average prompts, you consume 800K TPM. If a sibling workload runs at 500 RPM with 12,000-token average prompts, it consumes 6M TPM. The chat team's request rate has not changed, but the chat team now sees fewer tokens per minute of headroom than they did yesterday. When the org bucket throttles, it throttles whoever the SDK happens to ask next — and SDK retries with exponential backoff push the chat path's p99 into the seconds without any signal that the chat path itself is misbehaving.

Why the On-Call Investigation Goes Nowhere

The first thing the chat team's on-call does is look at the chat service. The dashboards are clean. The request count is normal. The prompt size distribution is normal. The retries dashboard shows a spike, but the input that caused the retries is upstream — it is the provider's 429s, and the 429s are caused by traffic the chat team has no visibility into.

Cross-team rate-limit blame is structurally hard because the failure is one step removed from the team that owns the SLO. The chat team owns p99 latency. The summarization team owns its own request rate, its own prompt size distribution, and its own bill. Both teams hit their own internal budgets. The aggregate consumption breaches the provider's TPM ceiling. There is no single team whose individual plan was wrong.

This is the failure mode where everybody is internally consistent and the system is globally broken. Each team's capacity planning checks out against the budget that team negotiated. The composition does not, because no team's budget was denominated in the same unit the provider's throttle is denominated in. The chat team budgeted in RPM. The summarization team budgeted in dollars and RPM. The provider throttles in TPM. Three units, one bucket, no owner.

The reflex fix — bump the provider tier — buys time but does not fix the structure. The next prompt-size shift, the next new workload, the next product launch on the same org key reproduces the same pattern at a higher absolute budget. The unit mismatch is the bug. The tier ceiling is the symptom buffer.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates