2 posts tagged with "token-budget"

The Token Budget That Ran Out Mid-Conversation: Why Free-Tier Users Think Your Model Got Dumber

May 28, 2026 · 12 min read

Software Engineer

A product manager I know spent two weeks triaging a churn spike on her company's AI writing assistant. Free-tier session length had collapsed by 30%, the support inbox filled up with variations of "your model used to be smart, now it's lazy," and the team's first instinct was to blame a model upgrade that had shipped the same week. The model had not changed. What had changed was that finance had quietly tightened the per-user token budget mid-quarter, and the app had been silently truncating system prompts, dropping tool calls, and shortening responses for any user who crossed the new threshold. From the user's seat, the AI had degraded. From the dashboard, nothing was wrong. Both were true, and that is the failure mode.

This pattern is everywhere now. ChatGPT's free tier drops to a smaller model when the limit is hit, with no in-product label other than "responses may be shorter for a while." Anthropic's free tier behaves similarly. Build a feature on top of either, layer on your own per-user budget for cost control, and you have stacked two invisible cliffs in series — the platform's and yours — and the user, who only sees one chat box, has no way to tell which one they just walked off.

Agentic Task Complexity Estimation: Budget Tokens Before You Execute

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Two agents receive the same user message. One finishes in 3 seconds and 400 tokens. The other enters a Reflexion loop, burns through 40,000 tokens, hits the context limit mid-task, and produces a half-finished answer. Neither the agent nor the calling system predicted which outcome was coming. This is not an edge case — it is the default behavior when agents start tasks without any model of how deep the work will go.

LLM-based agents have no native sense of task scope before execution. A request that reads as simple in natural language might require a dozen tool calls and multiple planning cycles; a complex-sounding request might resolve in a single lookup. Without pre-execution complexity estimation, agents commit resources blindly: the context window fills quadratically as turn history accumulates, planning overhead dominates execution time, and by the time the system detects a problem, the early decisions that caused it are irreversible.

About Tian Pan