Skip to main content

720 posts tagged with "llm"

View all tags

The SSE Keep-Alive Your Reverse Proxy Stripped, And The Prompt You Paid For Twice

· 10 min read
Tian Pan
Software Engineer

Your agent called a tool that took 35 seconds. During those 35 seconds, no tokens flowed from the model back to the browser. The provider's SSE stream was still open. Your tool was still running. The user's spinner was still spinning. And somewhere in the middle, a reverse proxy you do not control decided the connection had been quiet for too long, closed it, and your client's reconnection logic dutifully restarted the entire request from scratch.

The first response was 4,200 prompt tokens and 600 completion tokens. The second response was 4,200 prompt tokens and 600 completion tokens. The user got one answer. Your invoice got two.

The Summarizer That Paraphrased Away the User's Literal Question

· 8 min read
Tian Pan
Software Engineer

A user asks: "Does this qualify as a 'transfer' under article 28?" Forty turns later, the model gives an answer to a different question. The transcript shows the model answered the question it was given. The user is reading a complaint that reads like a hallucination. Both are right. The model never saw the user's question — it saw your summarizer's polite translation of it: "user asked about article 28 applicability."

The word "transfer" was the question. The summarizer threw it away because the summarizer's loss function was tuned to preserve facts, not wording, and the rubric never learned the difference between paraphrasing the topic and paraphrasing the constraint. Topic was preserved. Constraint became fog.

This failure mode is structural, not anecdotal. Any application that compresses long conversations with a model-generated summary has a second model in the critical path — one whose quality contract is usually treated as a token-budget knob rather than as a piece of product logic. That asymmetry is where the bug lives.

The Token Count Your Client Estimated And Your Provider Invoiced

· 12 min read
Tian Pan
Software Engineer

Your application counted tokens locally with a tokenizer library matching what you believed the provider used. The SDK reported "estimated 4,200 tokens" before each call. Your budget logic admitted the request. Then the provider's invoice came back at 6,800 tokens for the same payload. Multiply that 60% gap by a few million calls a month and the line item your finance team cannot reconcile against your own logs starts to look like an architectural mistake rather than a rounding error.

The mistake is not that the local tokenizer was wrong. The mistake is treating the local tokenizer as a contract instead of a guess. Tokenization is something the provider does inside their serving stack — your library is a model of that process, not the process itself, and the two drift in ways that are small per call and structural across the population of calls you actually make.

Your Latency SLO Is a Function of Other Teams' Prompt Sizes

· 10 min read
Tian Pan
Software Engineer

Your chat product has been running quietly at a 1.5-second p99 latency SLO for months. The request rate is flat, the prompt sizes are flat, the model has not changed. Then, on a Tuesday afternoon, p99 jumps to 4.8 seconds and stays there. The on-call investigation finds no anomaly in the chat path: same requests-per-minute, same median prompt of around 800 tokens, same retry behavior on the SDK. The deploy log for the chat service is empty for the day. The breach lasts six hours.

The cause is in another team's repo. That morning, a long-document summarization feature shipped on the same organization key, with average prompts of 12,000 tokens. Their request rate is modest — a few hundred per minute — but each call burns through the shared tokens-per-minute budget fifteen times faster than yours. The provider's throttle fires on the chat path because the chat path was holding the same bucket the summarization team just emptied. Nobody changed your code, nobody breached anyone's planned capacity, and your SLO is now a function of a workload your team has never read.

The Cache Stampede That Hit Your Model Provider Instead of Your Database

· 10 min read
Tian Pan
Software Engineer

The pager went off at 14:02 UTC. Not for latency, not for errors — for spend. The cost dashboard showed a vertical line: three minutes of input-token billing at roughly nine times the trailing hourly average, then back to normal. No regression had shipped. No tenant had onboarded. Traffic was flat to the minute. The only thing that changed is that a single prompt prefix — the 14K-token system message that every agent in the fleet shared — had quietly expired on the provider side, and a thousand workers had all decided, within the same 200ms window, that they were the ones who needed to write it back.

This is a cache stampede. It is the same bug operators have been writing post-mortems about since memcached shipped in 2003. What is new in 2026 is that the cache it stampedes is no longer yours. It lives inside your model provider, you cannot inspect its state, and every miss costs real money instead of a few extra database queries. The synchronization bug that database engineers learned to jitter away two decades ago has quietly reappeared on a bill line item nobody thought to defend.

The Deterministic Seed Your Provider Treated as a Hint, Not a Contract

· 10 min read
Tian Pan
Software Engineer

The CI test was a single assertion: same model, same temperature, same prompt, same seed, same output string. It passed on every developer's laptop, passed on the first hundred CI runs, and then flaked once every fifty runs for three weeks before anyone admitted the pattern was real. The first hypothesis was the obvious one — a non-deterministic dependency somewhere in the test harness — and three days of investigation found nothing. The actual cause was sitting in a footnote on the provider's API reference: "seed provides best-effort determinism." The team had read the parameter name and assumed a contract. The provider had documented a hint.

This is a specific failure mode of hosted inference that catches teams who design test infrastructure around a single mental model: the model is a pure function of its inputs, and the seed is what makes the function reproducible. Both halves of that model are wrong in production, and the gap between the API surface and the underlying physics is wide enough that teams build entire eval and regression-test stacks on top of an assumption their provider explicitly disclaimed.

The Heavy Tail Your Token Forecast Never Priced

· 9 min read
Tian Pan
Software Engineer

The cost forecast for your AI feature was modeled on a 50-user pilot. Those users typed three-sentence prompts because that is what people type into a beta they were asked to evaluate. Production launched, you crossed ten thousand users, and the finance team flagged that your model bill is running at three times the per-user number from the deck. You went looking for the bug. There is no bug. Your pilot was sampling from one distribution and production is sampling from another, and the difference between them is a long tail of users who learned about your product on Twitter and are pasting thirty kilobytes of unstructured context they screenshotted from a thread.

This is the same financial mistake every consumer internet company learned in the 2010s, transplanted onto LLM economics. The pilot's median user is not the production p99.5, and a token cost model that uses the mean as its forecasting input has already lost the argument with the bill.

The Latency Budget Your Orchestrator Spent on Its Own Planning Step

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a week-long instrumentation pass on a customer-support agent that had, on paper, a perfectly reasonable median latency. P50 was inside SLO, P95 was uncomfortable but explainable, and the tool-call traces looked healthy. Then someone bucketed the spans by type and the room got quiet. The agent was spending roughly 58% of its wall-clock per run inside spans labeled "plan," "reflect," "decide-next-step," and "self-check." Tool execution — the database lookups, the CRM writes, the auth checks — accounted for under 30%. The thing the agent was being measured on did less than the thing nobody was measuring.

That ratio is not a fluke. It is the natural state of any plan-act-observe loop that you do not actively police. The orchestrator is paid in latency for thinking and paid in latency for acting, and the thinking step is almost always cheaper to add than the acting step, so it grows unchecked. By the time you notice, "decide what to do next" has become its own line item — bigger than most of the line items you originally built the agent to serve.

The Localized System Prompt Your Model Performs Worse Against Than the English Original

· 11 min read
Tian Pan
Software Engineer

Your English system prompt took six weeks to tune. A staff engineer rewrote the constraint list four times, the eval suite finally cleared 94% on the held-out task set, and the launch checklist green-lit it for production. Then the i18n team picked it up, ran it through the same translation pipeline that handles button labels and tooltips, and shipped the Japanese, German, Hindi, and Arabic variants the next sprint. The launch dashboard for non-English markets shows the same task volume, the same user funnel, and — until a support ticket from a Tokyo customer surfaces six months later — the same green status.

The Tokyo customer's complaint is that the agent ignored an instruction the English prompt explicitly forbids. You re-read the Japanese prompt and it says the same thing, semantically. You re-run the English eval suite against the English variant and it passes. There is no eval suite for the Japanese variant. There never was.

The Retention Policy That Erased Context Your Model Was Still Reading

· 12 min read
Tian Pan
Software Engineer

A nightly retention worker deletes any user message older than thirty days. A long-running enterprise support session, opened in early March, is still active in late May. On the request that comes in at turn 41, your prompt assembler reads from the same messages table the retention worker has been quietly pruning. Turns 1 through 28 are gone. The model receives a conversation that starts at turn 29 with no signal that earlier turns ever existed. The user asks "what was the SLA we agreed on earlier?" and the model confidently invents a number, because the actual answer was in turn 4 — which the retention worker erased the night before.

This is not a model failure. The model did exactly what it was supposed to: produce a plausible answer from the context it was handed. The failure happened upstream, in the gap between two teams that each thought they owned the messages table.

The Self-Correction Loop That Shared Its Verifier's Blind Spot

· 10 min read
Tian Pan
Software Engineer

The screenshot that gets passed around in agent post-mortems looks the same every time. A long trace. A single task. Twelve iterations. The agent generated a draft, evaluated it, found a minor flaw, generated a revision, evaluated it, found a slightly different minor flaw, generated another revision. The score the verifier returned hovered between 0.78 and 0.84 the entire time. It never crossed the threshold. The agent never escalated. The job timed out three hours later at a token bill that would have paid for a quarter of a senior engineer's day.

The team called this a "self-correction" problem because that is what the architecture diagram labeled it. The actual failure was structural. The verifier was the generator wearing a different prompt. The convergence criterion was the model's own opinion. The retry budget was implicit, capped by the agent timeout rather than by anything the agent itself reasoned about. None of those three failures look like bugs in isolation, which is why teams ship them.

The Structured Output Schema Two Models Interpret Differently

· 9 min read
Tian Pan
Software Engineer

The first time your fallback route fires in production is the wrong time to discover that your two providers do not agree on what your schema means. The JSON Schema looks identical in both client configurations. The validator passes on both outputs. The downstream code reads the field by name and gets a value. And then a billing total comes out as a string of digits instead of an integer, or a list of length one arrives as a bare object instead of a single-element array, and a code path that has been green for six months silently returns the wrong answer.

The seductive thing about structured output is that it removes a class of bugs — unparseable JSON, hallucinated fields, missing keys — and so it feels like it removes the parsing problem entirely. What it actually does is move the parsing problem one layer up, from the lexer to the type system, where it is much harder to see. Two providers can both honor a JSON Schema and still produce outputs that are not interchangeable, because "honor" has at least four distinct meanings in this corner of the ecosystem and your schema does not specify which one you wanted.