Skip to main content

The Token Budget That Ran Out Mid-Conversation: Why Free-Tier Users Think Your Model Got Dumber

· 12 min read
Tian Pan
Software Engineer

A product manager I know spent two weeks triaging a churn spike on her company's AI writing assistant. Free-tier session length had collapsed by 30%, the support inbox filled up with variations of "your model used to be smart, now it's lazy," and the team's first instinct was to blame a model upgrade that had shipped the same week. The model had not changed. What had changed was that finance had quietly tightened the per-user token budget mid-quarter, and the app had been silently truncating system prompts, dropping tool calls, and shortening responses for any user who crossed the new threshold. From the user's seat, the AI had degraded. From the dashboard, nothing was wrong. Both were true, and that is the failure mode.

This pattern is everywhere now. ChatGPT's free tier drops to a smaller model when the limit is hit, with no in-product label other than "responses may be shorter for a while." Anthropic's free tier behaves similarly. Build a feature on top of either, layer on your own per-user budget for cost control, and you have stacked two invisible cliffs in series — the platform's and yours — and the user, who only sees one chat box, has no way to tell which one they just walked off.

The fix is not bigger budgets. The fix is treating the budget itself as a first-class part of the user interface — visible, predictable, and fair — so the cap stops being read as a quality regression and starts being read as what it actually is: a pricing signal the user can act on.

Why "Silent Degradation" Looks Like a Dumber Model

Users do not have a mental model for token economics. They have a mental model for product quality. When the same prompt that yielded a four-paragraph answer last Tuesday yields a two-sentence answer this Thursday, the only explanation available to them is that something about the AI got worse. They do not consider that the system prompt was trimmed to save tokens, that the retrieval step that grounded last week's answer was skipped today because the budget ran out before it ran, that the tool call that would have looked up live data was suppressed, or that the model itself was swapped for a smaller sibling halfway through their session.

This is not a hypothetical perception gap. A long-running Reddit thread about OpenAI silently halving effective context windows hit nearly 2,000 upvotes in 2025, and a recurring forum genre — "is ChatGPT getting dumber?" — has run continuously since at least early 2024. Most of these threads are not about model regressions at all. They are about the gap between what the marketing page advertised (a context window of N tokens) and what the runtime actually allocated to that specific session (some fraction of N, decided by load, plan tier, conversation length, and silent fallbacks). The model is doing what it was told to do. The user is reacting to what they were told they would get.

A useful frame: every "the AI got dumber" complaint is a leaked implementation detail. The user is reporting, in non-technical language, that they hit a budget boundary they did not know existed. If your product never told them the boundary was there, never showed them where they were against it, and never gave them a way to act on it, the only narrative available to them is regression.

The Anatomy of a Mid-Conversation Cliff

A typical AI feature does not have one cap. It has several, stacked, each invisible.

The system prompt budget. Modern agent stacks consume 30 to 60 percent of the context window before the user's first message even lands — system instructions, persona, safety guardrails, tool schemas (roughly 200 tokens per moderately documented tool, so five tools is a thousand-token tax on every call), retrieved RAG chunks, conversation history. Under cost pressure, the easiest thing to trim is the part the user cannot see. So the system prompt gets compressed, the persona gets ablated, the safety boilerplate gets shortened, and the model behaves measurably differently — for reasons that are entirely invisible to the person on the other end of the chat.

The conversation-history budget. When the running history exceeds its allocation, something has to give. Naïve implementations drop the oldest turns first, which means the user's stated preferences from earlier in the session quietly vanish; the agent forgets that they asked for short answers, that they are a beginner, that they corrected a misunderstanding twenty minutes ago. The user now experiences an AI that "doesn't listen," when in fact it has been instructed to forget.

The tool-call budget. Tool calls cost real money — each invocation runs another inference, often with the full conversation re-encoded. Cost-conscious gateways cap them per user per hour. When the cap is reached, the agent silently degrades to no-tool mode, which means the live-data path stops working and the model falls back to its training-data answer, often subtly wrong, never labeled. The user sees an answer that looks confident and is stale.

The response-length budget. Output tokens cost more than input tokens, often three to five times more, so under budget pressure the response gets shorter. Three paragraphs become one. The bullet list gets truncated. Streaming gets cut off with a "truncated" suffix the UI hides. The product feels rushed; the user feels rushed.

The model-tier budget. The most consequential: when the user crosses their personal limit, the gateway swaps the frontier model for a cheaper one mid-session. The conversation continues. The branding stays the same. The answers get noticeably worse in ways that are hard to articulate but easy to feel. ChatGPT's documented behavior — fall back to a mini model on free and Plus when the cap is hit — is the canonical example; every team building on top inherits the pattern.

None of these are bugs. All of them are reasonable engineering decisions in isolation. Together, they form a graded degradation curve that the user perceives as a single, mysterious, escalating quality regression. Each layer is also a separate billing surface, which is why finance ends up tightening them independently and at different times, with the result that no single team has the full picture of how the product feels at the budget edge.

The Cliff Versus the Slope

There are essentially two design philosophies for handling a user crossing the budget boundary.

The first is the cliff. The user gets the full product until they don't; at the threshold, a hard paywall or rate-limit error appears. This is easy to implement and brutal to receive. It also has the perverse property of being more legible than the alternative — at least the user knows what happened — but the legibility comes with maximum friction at exactly the moment the user was deepest in their task. Cliffs convert a fraction of users, infuriate the rest, and leak a steady stream of churn from people who hit the wall once and remember it.

The second is the slope, which is what most AI products have drifted toward. Quality degrades gracefully across many small dimensions: shorter outputs, dropped tools, swapped models, trimmed memory. There is no single moment the user can point to. The product just feels worse over time. This avoids the cliff's friction but creates a worse problem — the user attributes the degradation to the product itself rather than to a pricing signal they can act on. Slopes lose the upgrade conversation entirely, because the user never realizes one was on offer.

The third path, which the better-designed products are converging on, is the honest slope: graceful degradation paired with explicit, contextual visibility. The user still gets a softer landing than the cliff, but they are told, at the moment of degradation, that they hit a cap, what they got instead, and what restoring full quality would require. The degradation becomes a feature, not a flaw, because the user is now in the loop.

This third pattern matters because the upgrade conversation is where conversion happens. The data on freemium SaaS is consistent: contextual upgrade prompts tied to a specific capability boundary convert several times higher than generic ones, and the window of opportunity is short — most conversions happen within forty-eight hours of the triggering moment. Silent degradation throws that window away. The user feels frustration, attributes it to the product, and the product never gets the chance to reframe the frustration as a choice.

Designing for Budget Visibility

The translation work here is unglamorous but high-leverage. A few patterns that work in practice.

Surface the quota as a live signal, not an after-the-fact error. A small indicator that shows "responses today: 14 of 20" or "premium tools available for 8 more messages" turns an invisible constraint into a navigable one. The user can pace themselves, save the deep query for when it counts, and understand that scarcity exists without feeling ambushed by it. Time-to-quota-exhaustion is the analytic version of this signal; the user-facing version is even more important.

Label the degradation at the moment it happens. "This response was shortened to fit your free-tier limit" is a sentence the user can act on. "Here is your response" while the response is actually a degraded one is a sentence that breaks trust. The cost of the label is two lines of UI. The cost of not labeling it is the user concluding that your product is worse than it was yesterday.

Make the upgrade prompt match the boundary the user just crossed. A generic "upgrade to Pro" banner converts poorly. An upgrade prompt that says "to keep using the full model on conversations longer than 20 turns" converts dramatically better, because it is selling exactly the capability the user just lost. The contextual specificity is the conversion mechanic. Each budget boundary in your system is a candidate trigger; instrument them as such.

Give the user the choice the system already made for them. If the agent is about to swap models because the cheap one fits the budget, ask first when the stakes warrant it: "This question would normally use the full model — switch to the lighter one to stay within today's limit, or upgrade?" Most users will pick the lighter model and remain happy. The minority who would have upgraded now have an offer in front of them at exactly the right moment. The minority who would have churned silently are now visible to your funnel.

Treat the system-prompt and tool-call budgets as product surfaces too. If you compress the system prompt under load, log it as a product event, not just a backend metric, and surface it to product and design — not because users need to see "system prompt compressed" (they don't), but because the behavioral difference it causes is observable, and product needs to own that difference rather than discover it from a support ticket. The same goes for dropped tool calls and trimmed history: anything that changes user-visible behavior is a product event whether or not the team that ships the budget decision sees it that way.

Build the budget into the eval suite. Most evals run with an unconstrained context and unlimited tool calls. The version of your product that real free-tier users experience is the version that has been quietly compressed and degraded. Run a tier of evals against the degraded configuration. If the degraded version fails an eval the full version passes, you have measured the size of the experience gap your free tier is shipping — and that number is the size of the conversion pressure (or the churn risk) you are sitting on.

What Engineering Owns and What Product Owns

The conversation that gets stuck on most teams is whose problem this is. Engineering owns the budget enforcement: the rate limiters, the gateway logic, the fallback chains, the per-tenant quotas. Product owns the user's experience of the budget. The trap is treating these as separate workstreams. They are the same workstream, and the failure mode is what happens when they are not.

A useful internal rule: any budget decision that changes user-visible behavior — model swap, tool drop, response truncation, history compression — must ship with a UI affordance that names the change. If engineering wants to add a new budget dimension, the question "how does the user find out this happened" gets answered before the budget ships, not after the support tickets arrive. This sounds like overhead. It is the opposite of overhead. It is the only reliable way to keep "the model got dumber" out of your inbox, because the model did not get dumber. Your runtime got cheaper, and you forgot to tell anyone.

The freemium AI product is only sustainable when the gap between the free and paid experience is legible. Make the gap invisible and you lose both ends of the funnel: free users churn quietly because they think the product is bad, and paid users never appear because they were never given a reason to. Make the gap visible and the same budget cap becomes a pricing argument, a usage education, and a conversion event — three things at once, all from the same line of UI copy that engineering thought was a courtesy and product thought was a chore.

The token budget is not a backend concern. It is the product, every time it bites.

References:Let's stay in touch and Follow me for more thoughts and updates