Skip to main content

330 posts tagged with "observability"

View all tags

The OpenTelemetry Tail Sampler That Dropped Exactly the LLM Spans Your Post-Mortem Needed

· 11 min read
Tian Pan
Software Engineer

A user pings support: "the assistant told me to cancel my service to update my address, that's insane." Your team opens the incident, asks for the conversation ID, drops it into the tracing UI, and gets a polite "no spans found for this trace." The 24-hour retention window closed an hour ago. The tail sampler decided this conversation was a routine success because the response was a syntactically valid JSON object, returned with a 200, in 1.4 seconds. By every signal your collector understood, nothing happened.

The model returned a sentence that destroyed a customer relationship, and your observability pipeline classified it as uneventful. This is not a bug in the sampler. The sampler did exactly what you configured it to do. The problem is that the policy you wrote was designed for a request-response world where "success" and "worth keeping" were close enough to be the same thing, and you ported it unmodified into a system where they are not.

The Persona Your System Prompt Offered That the Model Picked the Same Way Every Time

· 10 min read
Tian Pan
Software Engineer

A product team I talked to recently ran a three-arm A/B test on response personas — concise, thorough, conversational — for three weeks across every cohort. The system prompt described all three and asked the model to pick the one that best matched the user. When they opened the dataset to write the readout, one number stopped them cold: the "thorough" arm had 91% of the traffic. The other two were rounding error.

Their experiment platform had not flagged anything. No alert fired. The pipeline did exactly what they had told it to do. Three weeks of supposed multi-persona testing had produced a dataset that could only tell them about thorough. The other two arms were too thin to power any inference at all.

The instinct in the room was that the prompt needed work — better instructions, sharper distinctions between personas, a more deliberate example for the conversational case. That diagnosis would have been right ten years ago in a rules-driven router. It is wrong for a model. The prompt was not the variable. The router was.

The Presigned URL That Expired Before Your User Could Verify the Multimodal Model's Claim

· 10 min read
Tian Pan
Software Engineer

A user opens yesterday's conversation. Next to their support agent's reply sits a broken-image placeholder where their uploaded receipt used to be. The reply confidently quotes "the charge of $47.32 dated March 14 at the merchant Coffee Tribunal." The user has no way to check whether that quote is accurate, because the evidence the model worked from is now a 403 from your object store. They file a hallucination ticket. Your eval suite did not catch it because the model was, at the time of the call, exactly right.

This is a story about retention mismatch, not about model quality. Your transcript outlived its grounding. The grounding was a presigned URL with a fifteen-minute clock, and the claim about the grounding is text that will sit in your database for years. When the asset clock and the claim clock run at different speeds, every correctly-grounded multimodal answer eventually looks like fabrication to whoever revisits it.

The Provider Auto-Router That Quietly Routed Your Premium Traffic To Haiku

· 10 min read
Tian Pan
Software Engineer

Your platform team adopted the provider's "auto" model identifier for cost reasons. The first dashboard after rollout was hard to argue with: a 34% spend reduction with no measurable quality drop on the weekly eval. Three months later, customer satisfaction on your shortest, highest-volume surface had been sliding for two quarters, and a product-led investigation eventually traced the regression to a model identifier nobody on the engineering team had touched. The code said "auto." The provider had been redefining what "auto" meant the whole time.

The lesson is not that auto-routing is bad. The lesson is that "auto" is a moving target whose distribution drifts with provider economics, and your eval's representativeness is the only check standing between vendor optimization and your product quality. If the eval does not match the traffic, the discount you celebrated is being paid out of a quality slope nobody is reviewing.

The Rate-Limit Headers Your Provider Returned That Disagreed With The Actual Throttle

· 10 min read
Tian Pan
Software Engineer

The response header said you had 480,000 tokens-per-minute of headroom. The 429 arrived after you spent 240,000. Your scheduler had been autoscaling against a number the runtime was never going to honor, and the burndown chart on the wall was reading the documentation while the throttler was enforcing something else entirely.

This is one of those failures that takes a long time to even notice, because every component along the path is doing exactly what it advertised. The provider returns a header. Your client parses it. Your scheduler reads it. Your dashboard plots it. None of these layers is broken. What is broken is the assumption that the header is a contract.

The Reserved Capacity Contract That Priced Out Your Overflow When the Provider Redefined the Bucket

· 10 min read
Tian Pan
Software Engineer

A platform team signed a multi-quarter reserved-throughput contract. Fixed per-token rate on committed capacity, a higher overage rate above the ceiling. Finance modeled the burn against six months of historical traffic that rarely crested the limit. The contract said "overflow" meant bytes-per-minute above the committed ceiling, and on that definition the deal was sound.

Six weeks later the bill was up 2.4× with no change to traffic shape, no change to routing config, no change to product surface. The provider had quietly revised the metering definition mid-quarter. "Overflow" now also counted any request the auto-router sent to a model tier above the one the reservation was anchored to — so a single Sonnet selection on a complex prompt landed in the overage bucket even when aggregate throughput sat comfortably inside the committed envelope. Thirty percent of traffic that used to invoice at the reserved rate now invoiced at the overage rate. Finance chased the spike through dashboards for three weeks before someone read the mid-quarter pricing addendum and found the redefinition in a footnote.

The contract had not been broken. The unit it was denominated in had been redenominated.

The Retry Budget That Hid Your Provider's Actual Error Rate From Your Dashboard

· 11 min read
Tian Pan
Software Engineer

The weekly review slide said 99.9%. The invoice said the bill had tripled. The two numbers had been on adjacent dashboards for months, and nobody had noticed that they were measuring different worlds. The reliability number was post-retry — every call that eventually returned a 200 counted as a success — and the cost number was every attempt the client made, billed by the token. Between them sat a generous five-attempt retry loop and a provider whose tail latency had been quietly degrading. The first time anyone looked at both numbers together was during an outage, when the cost-anomaly alert fired before the availability alert did.

That is the whole pattern. A retry budget that looks like a reliability mechanism is also a cost-quality knob, and the team that watches only one side of it is paying for an availability number the invoice will eventually correct.

The Session Affinity Your Provider Load Balancer Quietly Ignored

· 11 min read
Tian Pan
Software Engineer

Your dashboard says cache hit rate is 71%. Your finance partner is pleased. Your latency p50 is fine. Then a customer support thread arrives from a long-running agent session: turn 14 took eleven seconds to produce the first token, turn 15 took eight, turn 16 took nine. You pull the trace. Every one of those turns reports a cache_read_input_tokens value of zero. The system prompt is sixteen thousand tokens. The user thinks the agent is broken. You think your provider is broken. Neither of you is right. The aggregate hit rate is a survivorship statistic — it averages over the short conversations that hit cache trivially and quietly absorbs the long conversations that have collapsed to cold-on-first-token mid-session.

This is the failure mode that no provider postmortem will ever describe to you, because from their telemetry the system is working as designed. The load balancer is making the routing decision it was told to make. The cache is being populated and evicted on the schedule it was told to follow. The hint you passed — the prompt_cache_key, the conversation ID, the user ID, whatever string you serialized into that field — was advisory the whole time, and "advisory" means "ignored when convenient." Under load, when a scaling event happens, when an upstream pod is draining, when the affinity-aware tier is saturated, your hint quietly degrades to a uniform routing decision. The request lands on a cold pod. The KV tensors that would have served the prefix at sub-millisecond cost are sixteen feet away in a sibling rack and unreachable. Your conversation pays full-prefix cost again, and your dashboard's headline number doesn't move because two thousand other one-turn conversations hit cache fine.

The SSE Keep-Alive Your Reverse Proxy Stripped, And The Prompt You Paid For Twice

· 10 min read
Tian Pan
Software Engineer

Your agent called a tool that took 35 seconds. During those 35 seconds, no tokens flowed from the model back to the browser. The provider's SSE stream was still open. Your tool was still running. The user's spinner was still spinning. And somewhere in the middle, a reverse proxy you do not control decided the connection had been quiet for too long, closed it, and your client's reconnection logic dutifully restarted the entire request from scratch.

The first response was 4,200 prompt tokens and 600 completion tokens. The second response was 4,200 prompt tokens and 600 completion tokens. The user got one answer. Your invoice got two.

The Streaming Abort Your Provider Billed Anyway: A 14% Gap Hiding in Your Invoice

· 10 min read
Tian Pan
Software Engineer

Your finance team filed a dispute and lost. The line item is "output tokens" and it exceeds your sum-of-delivered-tokens metric by fourteen percent. The provider's support engineer closed the ticket as "expected behavior under streaming cancellation," with a link to a documentation page that says "cancellation stops billing at the last delivered token." Both sentences are true, and the gap between them is the line of code you have not written.

The contract you read says one thing. The inference scheduler does another. The mismatch is not a bug, not a billing error, and not malice — it is a layered system in which the cancellation signal travels through three boundaries (browser, edge, GPU) and the billing meter sits at the third boundary while your "stop generating" button sits at the first. Closing the gap is an engineering project with a finance owner.

The Streaming UI That Committed a Partial Answer Your Model Never Finished

· 10 min read
Tian Pan
Software Engineer

The post-mortem read like a hallucination report. A user had acted on a confidently-worded recommendation that turned out to be wrong in a way the model would not have written if it had finished — except the trace showed the model had not finished. The provider connection dropped at token 412 of an expected 800. The client's error handler logged the failure. The persisted partial message, written to the conversation history as tokens arrived, sat in the user's UI looking exactly like every other complete answer. They acted on it. Support categorized the ticket as a content-quality issue. It took two weeks to route it to the platform team.

Nothing in this chain was a model failure. The model behaved correctly for the 412 tokens it produced. The failure was that the streaming UI and the durable conversation history had quietly disagreed about what counts as a message — and during the exact failure mode that streaming was supposed to make tolerable, the disagreement became the canonical record.

This is the contract between optimistic rendering and durable storage. Most chat products inherit it from a tutorial or a framework without thinking about it as a contract at all, and the gap shows up as a tail of incidents that look like model bugs and aren't.

The Token Count Your Client Estimated And Your Provider Invoiced

· 12 min read
Tian Pan
Software Engineer

Your application counted tokens locally with a tokenizer library matching what you believed the provider used. The SDK reported "estimated 4,200 tokens" before each call. Your budget logic admitted the request. Then the provider's invoice came back at 6,800 tokens for the same payload. Multiply that 60% gap by a few million calls a month and the line item your finance team cannot reconcile against your own logs starts to look like an architectural mistake rather than a rounding error.

The mistake is not that the local tokenizer was wrong. The mistake is treating the local tokenizer as a contract instead of a guess. Tokenization is something the provider does inside their serving stack — your library is a model of that process, not the process itself, and the two drift in ways that are small per call and structural across the population of calls you actually make.