Conversation-Aware Rate Limiting: Why Per-Request Throttling Breaks Multi-Turn AI
Your AI feature works in testing. Single-turn Q&A, perfect. Run it in production with a real user sitting in a 10-turn debugging session and it fails — not because the model broke, but because your rate limiter was designed for a completely different world.
The standard API rate limit is a blunt instrument built for stateless REST calls. Each request is treated as an independent, roughly equal unit of consumption. That model works fine for CRUD endpoints where every call is indeed comparable. It falls apart for multi-turn conversations, where each successive turn gets more expensive, a single user interaction can trigger dozens of internal model calls, and a mid-session cutoff is far more damaging than a failed single-shot query ever was.
The Hidden Cost Explosion Inside a Conversation
The compounding token cost of multi-turn conversations is the root cause of the mismatch, and it's more severe than most teams expect.
Turn 1 of a conversation costs some baseline number of tokens — say, 300 input tokens for the user's question plus 200 output tokens for the model's reply. Turn 2 now includes the full history: the user's first message, the model's first reply, and the new user message. By turn 10, the input to the model includes 9 prior exchanges. In a typical debugging session, this can push input token consumption to 7x what it was in turn 1. The user has the subjective experience of "having one conversation," but the API sees token consumption that looks like multiple heavy workloads arriving simultaneously.
This compounds further in agentic contexts. When a user asks a copilot to "refactor this file," the agent internally reads the file, plans edits, applies them, re-reads to verify, handles tool results, and loops again. What surfaces as one user interaction translates to 10–20 individual model calls. Each of those calls draws from the same rate-limit bucket. A user with normal usage patterns hits a 429 mid-task and has no idea why — they didn't send 47 requests, they sent one.
The GitHub Copilot Agent Mode situation is a clean example. Agent Mode's agentic loops consume session budgets at 10–20x the rate of regular chat because every loop iteration is a discrete API call. Users who had been operating fine in conversational mode found themselves throttled immediately when switching to Agent Mode for the first time. Their usage hadn't spiked in any subjectively meaningful sense, but the infrastructure disagreed.
What "Session Budget" Actually Means
The fix isn't to raise raw limits — it's to change what you're measuring. Instead of counting individual requests or tokens as an undifferentiated stream, you attach resource consumption to a session and treat the session as the unit of throttling.
A session has a unique ID that persists for the duration of a conversation. All API calls originating from that session draw from the same pool. When the pool is exhausted, the session is limited — not the user's next independent conversation. Fresh sessions get fresh budgets. The effect is that a user who runs three short independent conversations in a minute doesn't suffer for a 30-turn debugging thread they had last hour.
The dual-window structure matters here. A per-minute token budget controls interactive latency — it's how you protect the model serving layer from instant spikes. A per-hour or per-day budget controls overall cost. Session tracking applies to both. When Anthropic excludes cached input tokens from the per-minute token count, this isn't an accident; it's an acknowledgment that reused context represents a fundamentally different computational load and shouldn't count the same as fresh inference.
The session ID also creates useful behavioral incentives. Users who close a conversation and open a new one to reset rate limits find that the new session gets a new budget, but it's tied to their account, not the window instance. You can't farm budgets by opening five chat tabs. GitHub Copilot's account-level session tracking, not chat-window-level tracking, reflects this design choice explicitly.
Implementation is straightforward: the client generates a UUID when a conversation starts, includes it as a header on every request (X-Session-ID), and the server maintains a map of session ID → budget consumed. Session entries expire after inactivity. The complexity is in the dual-window sliding counter, not the session tracking itself.
Retry Storms and Semantic Deduplication
When conversations get throttled, clients retry. That's expected. The problem is when retries are poorly coordinated.
Synchronized retries are a classic failure amplifier. If 10,000 clients hit a rate limit at the same time and all apply a 2-second fixed backoff, 10,000 requests arrive simultaneously at second 2, likely causing the same failure again.
The retry interval needs jitter — a randomized component that spreads requests across time. With a 10% jitter window, one client retries at 1.83 seconds, another at 2.17, another at 1.94. The server sees a smooth recovery ramp instead of a synchronized hammer.
But the real problem for conversational AI isn't coordinated machine retries; it's coordinated human retries. A user whose turn 10 fails in a multi-step reasoning task clicks "regenerate." The client may simultaneously retry under the hood. Two identical requests are now in-flight. If both succeed and the model's output is non-deterministic, the user sees inconsistent behavior. If both hit rate limits again, the user's budget is charged twice for nothing.
Idempotency keys solve the duplicate problem at the infrastructure level. The client includes a key derived from the session ID, turn number, and a hash of the input. The server checks whether it's already processing or has already processed a request with that key. If yes, it returns the cached response. This prevents duplicate charges and eliminates the race condition between human and machine retries.
Semantic deduplication extends this to near-duplicate requests. "Can you help me debug this?" and "Please fix this issue" are meaningfully identical requests that a naive hash-based deduplication misses. Embedding the request and comparing against a short-TTL cache of recently processed requests from the same session lets you collapse these into one. For retry storms specifically, the window is short — 30 to 60 seconds — but it's sufficient to absorb the human-retry pattern.
Graceful Degradation Is a Design Choice
The standard failure mode for a rate-limited multi-turn conversation is an abrupt 429 error at turn 10 of a 15-turn reasoning task. The user loses all context accumulated in those 10 turns. They must start over, consume budget on the restart, and get a worse result because the context is gone. The infrastructure "protected itself" by making the user experience significantly worse than no AI feature would have.
Graceful degradation is the alternative. It requires treating the rate limit not as a binary wall but as a signal to degrade quality rather than availability.
The simplest implementation is proactive signaling. When the session's remaining token budget drops below a threshold — say, 20% of the hourly budget — the server includes a response header indicating the session is approaching limits. The client can switch from a large model to a smaller one, reduce the context window by summarizing earlier turns, or warn the user that the current conversation is approaching its natural limit. Any of these is better than an opaque 429 mid-response.
Model downgrading in response to rate-limit pressure is underused. Switching from a frontier model to a smaller one reduces per-turn token cost, which buys additional turns without hitting the budget wall. The user notices a mild quality reduction in later turns, which is generally acceptable — especially for the kind of wrap-up or summarization tasks that tend to come at the end of a long session.
Context compaction is the other lever. A 20-turn conversation where turns 1–15 are still in the full context window is paying 60–70% of its token budget on history, not new reasoning. Summarizing turns 1–10 into a compact summary and dropping the raw messages can cut context cost by half or more. Server-side compaction makes this invisible to the user. Client-side compaction, where the UI shows "earlier turns summarized," is honest and often acceptable.
Circuit breakers apply at a different layer. When a provider is returning elevated error rates — even from non-rate-limit causes — a circuit breaker trips and routes traffic to a fallback provider. This isn't just about rate limits; it's about the broader reliability guarantee that conversational features require. A session interrupted by an outage is similar in user impact to a session cut off by rate limiting. The right failure response is the same: preserve continuity, degrade quality, don't terminate abruptly.
The Interface Your API Owes the Client
Most LLM APIs today expose enough information for a well-engineered client to handle these situations, but they don't make it easy. The rate-limit response headers are there, but they're per-minute counters, not session budgets. The caller has to infer session state from raw aggregate numbers.
What the interface should expose:
- Session budget remaining (not just account-level TPM) — so clients can make per-conversation decisions without implementing their own accounting
- An "approaching limit" signal before the hard limit hits — so clients can degrade proactively rather than react to failures
- Distinction between cached and non-cached token consumption — because these represent different computational loads and should be budgeted separately
- Conversation-scoped idempotency — so retries within a session can be collapsed at the infrastructure level without client-side implementation
Some of these are beginning to appear as platform features. Session budget tracking and per-conversation observability are increasingly available through LLM gateway products. Native API support is still sparse.
Why This Is an Architecture Problem, Not a Tuning Problem
The temptation when conversations start hitting rate limits is to raise the numbers — get a higher tier, negotiate a higher TPM cap. That fixes the immediate symptom for current load but not the underlying design mismatch.
Per-request rate limiting applied to conversational workloads will always be inaccurate because the unit of measure is wrong. The resource being consumed is the session, and the metric being measured is the request. If you build features around multi-turn conversations and throttle them as if each turn is an independent transaction, you'll keep hitting the same failure mode at every new scale.
The underlying primitives — session budgets, retry deduplication, graceful degradation, proactive signaling — aren't complex. They're standard distributed systems patterns applied to a domain where they're still underused. The infrastructure work to implement them is well-defined. The organizational work is simpler: stop designing your rate limiting strategy around a model that assumed stateless requests, because the features you're shipping aren't stateless.
Multi-turn AI features are one product. Their throttling strategy needs to reflect that.
- https://aionx.co/ai-comparisons/ai-chatbot-rate-limits-compared/
- https://community.openai.com/t/how-to-handle-rate-limits-when-building-a-chatbot-with-openai-api/1357992
- https://platform.claude.com/docs/en/api/rate-limits
- https://developers.openai.com/api/docs/guides/rate-limits
- https://github.com/orgs/community/discussions/193263
- https://www.theregister.com/2026/04/15/github_copilot_rate_limiting_bug/
- https://portkey.ai/blog/rate-limiting-for-llm-applications/
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
- https://api7.ai/blog/rate-limiting-guide-algorithms-best-practices
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://markaicode.com/implement-graceful-degradation-llm-frameworks/
- https://www.vellum.ai/blog/how-to-manage-openai-rate-limits-as-you-scale-your-app
- https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- https://docs.github.com/en/copilot/concepts/usage-limits
