Skip to main content

720 posts tagged with "llm"

View all tags

The Token Forecast That Mistook a Holiday Trough for the New Baseline

· 10 min read
Tian Pan
Software Engineer

A capacity planner walks into the quarterly budget review with a token forecast built from a clean trailing four-week window. Three of those four weeks happened to span a regional holiday. Daily active sessions were down 40% across that span. The forecast lands 35% under what Q+1 actually consumes, the rate-limit dashboard flatlines red on day one of the new quarter, and the postmortem finds the model behaved exactly as specified — it averaged the most recent four weeks of demand and projected forward. The model was not wrong. The window was.

This is not a story about a bad forecaster. It is a story about treating LLM token spend as if it were the same shape as the EC2 bill it shares a cost center with. The EC2 bill is governed by infrastructure decisions you control: provisioned instances, reserved capacity, scaling policies that respond to load. The token bill is governed by users who decided to take a long weekend. The first is engineering output. The second is consumer demand. A planner who confuses the two will keep building forecasts on windows the calendar guarantees are non-stationary.

The Finish Reason Your Code Never Inspects

· 10 min read
Tian Pan
Software Engineer

Your handler did everything right. The HTTP status was 200. The body parsed. The text field had characters in it. You incremented responses_succeeded, appended the message to the conversation, returned the JSON down to the client, and moved on. The user got a sentence that ended mid-clause, a redacted answer dressed up as a normal one, or a polite refusal phrased as a completion. Your dashboard does not know any of that happened. The provider told you. You did not read the field.

Every major inference API returns a stop signal alongside the text: OpenAI calls it finish_reason, Anthropic calls it stop_reason, Gemini calls it finishReason. The field is small. It is one enum value per response. It is also the only out-of-band channel the model has for telling you whether the response you just shipped is the answer or a fragment of one. Treating it as cosmetic is the same shape of bug as ignoring HTTP status codes — except your monitoring caught the HTTP one a decade ago and has no opinion about this one.

The Library Version Your Coding Agent Remembers Wrong

· 10 min read
Tian Pan
Software Engineer

The diff looks clean. The agent imported the right module, called what looks like the right function, and TypeScript stayed quiet. The PR description even cites the docs. Then the build runs in CI and the call explodes with TypeError: x is not a function — because the function was split into two in a minor bump eight months ago, and the agent generated against the version of the library that existed inside its training data, not the version installed in your package.json.

This is not the kind of failure the "LLMs hallucinate" frame prepares you for. The model isn't inventing an API that never existed. It's remembering an API that existed once and doesn't anymore. The mental model the agent is reasoning from is a snapshot frozen at training time. The world has moved on. The codebase has moved on. And the agent has no idea, because nobody told it.

The Prompt Cache Your Personalization Layer Quietly Killed

· 11 min read
Tian Pan
Software Engineer

The product team ships personalization. The agent now greets the user by name, tunes its response length to their stated preference, knows the user works in healthcare, and respects the user's timezone for any date it mentions. The satisfaction lift is real and measurable — the A/B is a four-point win on thumbs-up rate and the rollout goes to one hundred percent. Three weeks later, finance flags that inference spend has roughly tripled, and nobody on the AI team can immediately explain why.

The explanation is one line of code change buried in the system-prompt builder. Per-user context — name, preferred response length, industry, timezone — got prepended to the system prompt so the model would see it on every turn. That made every user's prompt unique from the first token. Your provider's prompt cache, which had been serving roughly ninety percent of your input tokens at one-tenth the standard price, stopped hitting. Latency barely moved, so the perf dashboard stayed green. The billing dashboard caught up at month-end.

The Streaming Response That Contradicts Itself

· 8 min read
Tian Pan
Software Engineer

The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.

This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.

The Token Budget You Cannot See Until You Hit It

· 10 min read
Tian Pan
Software Engineer

Your team negotiated a monthly token allocation with your inference provider. The contract specifies the cap. The dashboard in the provider portal shows yesterday's usage with a one-day lag. The API itself returns per-minute rate-limit headers — anthropic-ratelimit-tokens-remaining, x-ratelimit-remaining-requests — and nothing about the monthly bucket you actually have to plan against. And your agent fleet has no mechanism to slow down as the budget depletes, because the only signal that arrives in real time is the 429 — which arrives after the budget is already gone, dressed up as the same transient error your retry logic was tuned to ignore.

This is a different shape of problem than rate limiting. Rate limits are a fast-moving throttle the consumer must react to within seconds; the headers tell you the bucket has a thousand tokens left and refills in forty seconds, and a well-written client backs off and tries again. Monthly quota is a slow-moving budget the consumer must plan against over weeks. The two get confused because they share the failure code and sometimes share the dashboard, but they require different controls — and the gap between what the provider exposes and what the consumer needs is where the worst incident of the month lives.

The Trace That Stops at the Provider Boundary

· 11 min read
Tian Pan
Software Engineer

You did the tracing work. Retrieval has a span. Tool calls have spans. The orchestration loop has a span. A trace ID rides through every internal hop on W3C traceparent headers, just like the SRE playbook says. Then the request hits messages.create, the SDK records a single span called llm.call, and the next 2.8 seconds of your pipeline turn into a black rectangle on the flame graph with no internal structure. The 800 milliseconds before the first token shows up: opaque. The 2 seconds of decode after that: opaque. The share of the wall clock that was network, queue wait, prefill, or per-token decode: unknowable from your trace.

When a customer reports "the assistant felt slow today," your dashboard can confirm the slowness. It cannot localize it. The most expensive minute of your pipeline — measured in dollars, in p95, in user-visible lag — lives inside a vendor's data center, and the contract you accepted when you signed up gives you almost no visibility into it. You are on call for a black box.

The Slow Turn That Wasn't Yours: KV Cache Eviction Mid-Conversation

· 10 min read
Tian Pan
Software Engineer

A conversation has been moving along on a single Claude session for forty minutes. Eleven turns, each averaging 800ms time-to-first-token, each cheap because the 28,000-token prefix is hitting the prompt cache. Turn twelve arrives and TTFT is 3.4 seconds. The transcript hasn't changed shape. The model didn't switch. The network is fine. Cached input tokens drop from 27,800 to 0. The next turn's prefill bill is paid in full, from the first token.

You go looking for the cause in your traces and find nothing that names it. There is no event in your logs labeled "another tenant's burst evicted you." The only honest reading of the spike is that some other customer's prompt, somewhere on the same GPU pool, made the scheduler decide your warm prefix was the cheapest thing to drop. You cannot replay the turn. You cannot prove the eviction. The cache state at that moment was a function of strangers' traffic, and that traffic is not in your trace because it was never yours to see.

The Agent That Could Not Say Wait

· 10 min read
Tian Pan
Software Engineer

Pick any production agent built in the last two years and inventory the things it can actually do on a given turn. The list is short: emit a tool call, return a final answer, or ask the user a clarifying question. That is the entire action vocabulary. Notice what is missing. There is no verb for "I would like more time before deciding." There is no verb for "I am uncertain enough that I want to pause and reconsider without committing." There is no verb for "I want to dwell on this for a moment before I do anything." The agent literally cannot say wait. The grammar does not contain the word.

This is not a polish problem. It is a structural one. The moment the agent's only outputs are actions, every internal state has to be expressed through an action. Hesitation becomes a redundant tool call. Doubt becomes a confident commitment. The team that designed only the action verbs has shipped an agent whose only language is doing, and then they wonder why it never seems to think.

The Planner That Treated Every Tool as O(1)

· 9 min read
Tian Pan
Software Engineer

Your planner emits five tool calls. On paper, it reads like a clean solution: lookup_user, search_documents, call_external_api, spawn_sub_agent, request_human_approval. The trace looks elegant, the logic is sound, the agent will arrive at the right answer. In production, those five steps take 12 milliseconds, 800 milliseconds, 4 seconds, 2 minutes, and 6 hours respectively. The planner never noticed that its five-step plan spans nine orders of magnitude in cost.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Planner%20That%20Treated%20Every%20Tool%20as%20O(1%29)

This is not a hallucination. The model picked the right tools. It picked them in a sensible order. What it could not do — what the tool schema gave it no way to do — was reason about the fact that the last step in its plan is qualitatively different from the first one. To the planner, a tool is a tool. Every node in the plan graph has weight one.

The Verifier Loop That Couldn't Converge

· 11 min read
Tian Pan
Software Engineer

The most expensive bug in an agent system is the one with no error message. Worker proposes a draft. Verifier rejects it with a paragraph of feedback. Worker revises. Verifier rejects again. The loop keeps spinning, the trace keeps growing, the bill keeps climbing, and from the outside the system looks like it is working — diligently, in fact, because both models are doing their assigned job. What nobody priced in is that the verifier's acceptance criteria are not fixed across calls. The target the worker is chasing is moving, and the loop has no convergence guarantee.

You shipped "iterate until satisfied," and you shipped a search through a space whose extrema may not exist.

The Cached Prompt Prefix That Grew Arms and Legs

· 11 min read
Tian Pan
Software Engineer

Six months ago your prompt prefix was 4,000 tokens. It was stable, cache-warm, and amortized to almost nothing — the per-call surcharge for system instructions was a rounding error against the per-call cost of the response. Today that prefix is 11,000 tokens, your cache hit rate has slid from 92% to 31%, and your inference bill is up 4x. Nobody on the team can point to the PR that did it. There is no commit message saying "increase prompt tokens by 7,000." Every change was small, every change was defended, every change shipped clean.

The prefix grew arms and legs the way a basement collects boxes. One team needed the user's tier injected so the agent could explain plan limits. Another needed today's date in the user's timezone for "remind me tomorrow" to work. A third stapled in the active A/B variant name so eval traces could be sliced. Marketing added the current promo banner so the agent could mention it on prompt. Compliance added a feature-flag manifest so the model could refuse beta features for users not in the rollout. Each was a one-line addition. Each was defensible in isolation. The aggregate destroyed your cache.