Skip to main content

702 posts tagged with "llm"

View all tags

The Deterministic Seed Your Provider Treated as a Hint, Not a Contract

· 10 min read
Tian Pan
Software Engineer

The CI test was a single assertion: same model, same temperature, same prompt, same seed, same output string. It passed on every developer's laptop, passed on the first hundred CI runs, and then flaked once every fifty runs for three weeks before anyone admitted the pattern was real. The first hypothesis was the obvious one — a non-deterministic dependency somewhere in the test harness — and three days of investigation found nothing. The actual cause was sitting in a footnote on the provider's API reference: "seed provides best-effort determinism." The team had read the parameter name and assumed a contract. The provider had documented a hint.

This is a specific failure mode of hosted inference that catches teams who design test infrastructure around a single mental model: the model is a pure function of its inputs, and the seed is what makes the function reproducible. Both halves of that model are wrong in production, and the gap between the API surface and the underlying physics is wide enough that teams build entire eval and regression-test stacks on top of an assumption their provider explicitly disclaimed.

The Heavy Tail Your Token Forecast Never Priced

· 9 min read
Tian Pan
Software Engineer

The cost forecast for your AI feature was modeled on a 50-user pilot. Those users typed three-sentence prompts because that is what people type into a beta they were asked to evaluate. Production launched, you crossed ten thousand users, and the finance team flagged that your model bill is running at three times the per-user number from the deck. You went looking for the bug. There is no bug. Your pilot was sampling from one distribution and production is sampling from another, and the difference between them is a long tail of users who learned about your product on Twitter and are pasting thirty kilobytes of unstructured context they screenshotted from a thread.

This is the same financial mistake every consumer internet company learned in the 2010s, transplanted onto LLM economics. The pilot's median user is not the production p99.5, and a token cost model that uses the mean as its forecasting input has already lost the argument with the bill.

The Localized System Prompt Your Model Performs Worse Against Than the English Original

· 11 min read
Tian Pan
Software Engineer

Your English system prompt took six weeks to tune. A staff engineer rewrote the constraint list four times, the eval suite finally cleared 94% on the held-out task set, and the launch checklist green-lit it for production. Then the i18n team picked it up, ran it through the same translation pipeline that handles button labels and tooltips, and shipped the Japanese, German, Hindi, and Arabic variants the next sprint. The launch dashboard for non-English markets shows the same task volume, the same user funnel, and — until a support ticket from a Tokyo customer surfaces six months later — the same green status.

The Tokyo customer's complaint is that the agent ignored an instruction the English prompt explicitly forbids. You re-read the Japanese prompt and it says the same thing, semantically. You re-run the English eval suite against the English variant and it passes. There is no eval suite for the Japanese variant. There never was.

The Retention Policy That Erased Context Your Model Was Still Reading

· 12 min read
Tian Pan
Software Engineer

A nightly retention worker deletes any user message older than thirty days. A long-running enterprise support session, opened in early March, is still active in late May. On the request that comes in at turn 41, your prompt assembler reads from the same messages table the retention worker has been quietly pruning. Turns 1 through 28 are gone. The model receives a conversation that starts at turn 29 with no signal that earlier turns ever existed. The user asks "what was the SLA we agreed on earlier?" and the model confidently invents a number, because the actual answer was in turn 4 — which the retention worker erased the night before.

This is not a model failure. The model did exactly what it was supposed to: produce a plausible answer from the context it was handed. The failure happened upstream, in the gap between two teams that each thought they owned the messages table.

The Self-Correction Loop That Shared Its Verifier's Blind Spot

· 10 min read
Tian Pan
Software Engineer

The screenshot that gets passed around in agent post-mortems looks the same every time. A long trace. A single task. Twelve iterations. The agent generated a draft, evaluated it, found a minor flaw, generated a revision, evaluated it, found a slightly different minor flaw, generated another revision. The score the verifier returned hovered between 0.78 and 0.84 the entire time. It never crossed the threshold. The agent never escalated. The job timed out three hours later at a token bill that would have paid for a quarter of a senior engineer's day.

The team called this a "self-correction" problem because that is what the architecture diagram labeled it. The actual failure was structural. The verifier was the generator wearing a different prompt. The convergence criterion was the model's own opinion. The retry budget was implicit, capped by the agent timeout rather than by anything the agent itself reasoned about. None of those three failures look like bugs in isolation, which is why teams ship them.

The Structured Output Schema Two Models Interpret Differently

· 9 min read
Tian Pan
Software Engineer

The first time your fallback route fires in production is the wrong time to discover that your two providers do not agree on what your schema means. The JSON Schema looks identical in both client configurations. The validator passes on both outputs. The downstream code reads the field by name and gets a value. And then a billing total comes out as a string of digits instead of an integer, or a list of length one arrives as a bare object instead of a single-element array, and a code path that has been green for six months silently returns the wrong answer.

The seductive thing about structured output is that it removes a class of bugs — unparseable JSON, hallucinated fields, missing keys — and so it feels like it removes the parsing problem entirely. What it actually does is move the parsing problem one layer up, from the lexer to the type system, where it is much harder to see. Two providers can both honor a JSON Schema and still produce outputs that are not interchangeable, because "honor" has at least four distinct meanings in this corner of the ecosystem and your schema does not specify which one you wanted.

The Token Forecast That Mistook a Holiday Trough for the New Baseline

· 10 min read
Tian Pan
Software Engineer

A capacity planner walks into the quarterly budget review with a token forecast built from a clean trailing four-week window. Three of those four weeks happened to span a regional holiday. Daily active sessions were down 40% across that span. The forecast lands 35% under what Q+1 actually consumes, the rate-limit dashboard flatlines red on day one of the new quarter, and the postmortem finds the model behaved exactly as specified — it averaged the most recent four weeks of demand and projected forward. The model was not wrong. The window was.

This is not a story about a bad forecaster. It is a story about treating LLM token spend as if it were the same shape as the EC2 bill it shares a cost center with. The EC2 bill is governed by infrastructure decisions you control: provisioned instances, reserved capacity, scaling policies that respond to load. The token bill is governed by users who decided to take a long weekend. The first is engineering output. The second is consumer demand. A planner who confuses the two will keep building forecasts on windows the calendar guarantees are non-stationary.

The Finish Reason Your Code Never Inspects

· 10 min read
Tian Pan
Software Engineer

Your handler did everything right. The HTTP status was 200. The body parsed. The text field had characters in it. You incremented responses_succeeded, appended the message to the conversation, returned the JSON down to the client, and moved on. The user got a sentence that ended mid-clause, a redacted answer dressed up as a normal one, or a polite refusal phrased as a completion. Your dashboard does not know any of that happened. The provider told you. You did not read the field.

Every major inference API returns a stop signal alongside the text: OpenAI calls it finish_reason, Anthropic calls it stop_reason, Gemini calls it finishReason. The field is small. It is one enum value per response. It is also the only out-of-band channel the model has for telling you whether the response you just shipped is the answer or a fragment of one. Treating it as cosmetic is the same shape of bug as ignoring HTTP status codes — except your monitoring caught the HTTP one a decade ago and has no opinion about this one.

The Library Version Your Coding Agent Remembers Wrong

· 10 min read
Tian Pan
Software Engineer

The diff looks clean. The agent imported the right module, called what looks like the right function, and TypeScript stayed quiet. The PR description even cites the docs. Then the build runs in CI and the call explodes with TypeError: x is not a function — because the function was split into two in a minor bump eight months ago, and the agent generated against the version of the library that existed inside its training data, not the version installed in your package.json.

This is not the kind of failure the "LLMs hallucinate" frame prepares you for. The model isn't inventing an API that never existed. It's remembering an API that existed once and doesn't anymore. The mental model the agent is reasoning from is a snapshot frozen at training time. The world has moved on. The codebase has moved on. And the agent has no idea, because nobody told it.

The Prompt Cache Your Personalization Layer Quietly Killed

· 11 min read
Tian Pan
Software Engineer

The product team ships personalization. The agent now greets the user by name, tunes its response length to their stated preference, knows the user works in healthcare, and respects the user's timezone for any date it mentions. The satisfaction lift is real and measurable — the A/B is a four-point win on thumbs-up rate and the rollout goes to one hundred percent. Three weeks later, finance flags that inference spend has roughly tripled, and nobody on the AI team can immediately explain why.

The explanation is one line of code change buried in the system-prompt builder. Per-user context — name, preferred response length, industry, timezone — got prepended to the system prompt so the model would see it on every turn. That made every user's prompt unique from the first token. Your provider's prompt cache, which had been serving roughly ninety percent of your input tokens at one-tenth the standard price, stopped hitting. Latency barely moved, so the perf dashboard stayed green. The billing dashboard caught up at month-end.

The Streaming Response That Contradicts Itself

· 8 min read
Tian Pan
Software Engineer

The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.

This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.

The Token Budget You Cannot See Until You Hit It

· 10 min read
Tian Pan
Software Engineer

Your team negotiated a monthly token allocation with your inference provider. The contract specifies the cap. The dashboard in the provider portal shows yesterday's usage with a one-day lag. The API itself returns per-minute rate-limit headers — anthropic-ratelimit-tokens-remaining, x-ratelimit-remaining-requests — and nothing about the monthly bucket you actually have to plan against. And your agent fleet has no mechanism to slow down as the budget depletes, because the only signal that arrives in real time is the 429 — which arrives after the budget is already gone, dressed up as the same transient error your retry logic was tuned to ignore.

This is a different shape of problem than rate limiting. Rate limits are a fast-moving throttle the consumer must react to within seconds; the headers tell you the bucket has a thousand tokens left and refills in forty seconds, and a well-written client backs off and tries again. Monthly quota is a slow-moving budget the consumer must plan against over weeks. The two get confused because they share the failure code and sometimes share the dashboard, but they require different controls — and the gap between what the provider exposes and what the consumer needs is where the worst incident of the month lives.