Your Latency SLO Is a Function of Other Teams' Prompt Sizes
Your chat product has been running quietly at a 1.5-second p99 latency SLO for months. The request rate is flat, the prompt sizes are flat, the model has not changed. Then, on a Tuesday afternoon, p99 jumps to 4.8 seconds and stays there. The on-call investigation finds no anomaly in the chat path: same requests-per-minute, same median prompt of around 800 tokens, same retry behavior on the SDK. The deploy log for the chat service is empty for the day. The breach lasts six hours.
The cause is in another team's repo. That morning, a long-document summarization feature shipped on the same organization key, with average prompts of 12,000 tokens. Their request rate is modest — a few hundred per minute — but each call burns through the shared tokens-per-minute budget fifteen times faster than yours. The provider's throttle fires on the chat path because the chat path was holding the same bucket the summarization team just emptied. Nobody changed your code, nobody breached anyone's planned capacity, and your SLO is now a function of a workload your team has never read.
