Skip to main content

177 posts tagged with "observability"

View all tags

LLM Model Routing Is Market Segmentation Disguised As A Cost Optimization

· 10 min read
Tian Pan
Software Engineer

The cost dashboard makes the case for itself. Sixty percent of traffic is "easy," a quick eval shows the smaller model lands within a couple of points on the global accuracy metric, and the routing layer ships behind a feature flag the same week. The graph bends. Finance is happy. The team moves on.

What nobody tracks is that the customer who hit the cheap path on Tuesday afternoon and the expensive path on Wednesday morning is now using two different products. The two models fail differently. They format differently. They refuse different things. They handle ambiguity, follow-up questions, and partial inputs with different defaults. From the customer's seat, the assistant developed amnesia overnight and nobody can tell them why — because internally, the change was filed as a finops win, not a product release.

Your On-Call Rotation Needs an AI-Literacy Prerequisite Before It Pages Anyone at 2am

· 12 min read
Tian Pan
Software Engineer

A platform engineer with eight years of incident-response experience opens a 2am page that says "AI assistant degraded — error rate 12%." She checks the model latency dashboard: green. She checks the model API status page: green. She checks the deploy log: nothing shipped in the last 72 hours. She does what any competent on-call does next — she pages the AI team. The AI engineer wakes up, opens the trace dashboard the platform engineer didn't know existed, sees that a single retrieval tool has been timing out for the last four hours because a downstream search index lost a replica, and resolves the incident in eleven minutes. The AI engineer goes back to bed at 3:14am. The retrospective the next morning records "AI feature outage, resolved by AI team." Nobody writes down the actual lesson, which is that the on-call engineer could have triaged this in five minutes if she had ever been taught what an AI feature's failure surface looks like.

This is the rotation tax that AI features quietly impose on every engineering org I've worked with in the last two years. The shared on-call rotation that worked beautifully for a stack of stateless services and a few databases breaks down the moment one of those "services" is an LLM-backed feature. The on-call playbook your SRE team built across a decade of post-mortems is calibrated for a world where "something is broken" decomposes into CPU, memory, network, deploys, and dependency timeouts. AI features add three more axes — the model, the prompt, the retrieval pipeline — and four more shapes of failure that don't show up on the dashboards your on-call was trained to read.

Prompt Cache Thrashing: When Your Largest Tenant's Launch Triples Everyone's Bill

· 10 min read
Tian Pan
Software Engineer

The bill arrives on the first of the month and it is three times what your spreadsheet said it would be. Nobody pushed a system prompt change. The dashboard says request volume is flat. p95 latency looks normal. The token-per-correct-task ratio is unchanged. And yet you owe the inference vendor an extra forty thousand dollars, and the only signal in the observability stack that even hints at why is a metric most teams never alarm on: cache hit rate, which dropped from 71% to 18% somewhere in the second week of the billing cycle, on a Tuesday, at 9:47 AM Pacific, which is when your largest tenant's customer-success team kicked off a coordinated onboarding push for two hundred new users.

Welcome to prompt cache thrashing — the multi-tenant failure mode that the SaaS playbook was supposed to have eliminated a decade ago, reintroduced through the back door by your inference provider's shared prefix cache. The provider's cache is shared across your organization's traffic. Your tenants share that cache with each other whether you want them to or not, and a single tenant whose prefix shape shifts overnight can evict the prefixes everyone else's unit economics depended on. The bill spikes for tenants who did nothing differently. Finance pages engineering. Engineering points at the dashboard, which shows nothing wrong, because the dashboard isn't measuring the thing that broke.

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently built what they called a "defense in depth" pipeline for their AI assistant. An input classifier checked for prompt injection. A jailbreak filter scanned for adversarial patterns. The model generated a response. An output moderation pass scanned the result. A refusal detector checked whether the model had punted, and if so, a reformulation step re-asked the question with a softer framing. The eval suite said the prompt produced answers in 1.4 seconds. Real users were waiting 3.8 seconds at the median and over 9 seconds at the p95.

Every safety layer is a round trip. Every round trip has a network hop, a queue time, a model load, and a decode. When you stack them serially in front of and behind the generative call, the latency budget you priced your product on dissolves — and almost no one accounted for it during design review. Worse: the slowest, most expensive path through your pipeline is the one that triggers on safety-adjacent prompts, which is exactly the long tail your safety story exists to handle. You are silently subsidizing that tail from the average user's bill.

The Reranker Is the Silent Second Model Your RAG Eval Never Measures

· 10 min read
Tian Pan
Software Engineer

A typical RAG pipeline ships with two models, not one. The retriever pulls 50 to 100 candidates from the vector store, and a reranker — a cross-encoder, an LLM-as-judge prompt, or a hybrid — re-scores those candidates and hands the top 5 to the answer model. Your eval suite measures end-to-end answer quality. It measures retriever recall@k. It does not measure the reranker. So when the reranker quietly drifts, the dashboard renders "answer quality dropped 4 points" with no causal arrow, and the team spends three days debugging a prompt that is not the problem.

The reranker is the silent second model. It sits between the retriever and the generator, it has its own scoring distribution, its own prompt (if it's LLM-based) or its own weights (if it's a cross-encoder), and it can regress independently of every other component. Most teams never grade it in isolation. The eval suite they wrote treats the pipeline like one model with a long context window, when it's actually two models in series with an interface neither team owns.

Retries Aren't Free: The FinOps Math of LLM Retry Policies

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter found a $4,200 line item on their inference invoice that nobody could explain. The dashboard showed normal traffic. The latency graphs were flat. The cause turned out to be a single agent stuck in a polite retry loop for six hours, replaying a 40k-token tool chain with exponential backoff that capped out at thirty seconds and then started over. The retry policy was lifted verbatim from an internal SRE handbook written in 2019 for a JSON-over-HTTP service. It worked perfectly. It worked perfectly for the wrong system.

This is the bill that does not show up in capacity-planning spreadsheets. The retry-policy patterns the industry standardized on for stateless REST APIs assume three things that LLM workloads quietly violate: failures are transient, the cost of one extra attempt is bounded, and a retry has a meaningful chance of succeeding. Each assumption was load-bearing. Each one is now wrong, and the variance the cost model never captured is sitting at the bottom of every monthly invoice.

The teams that have not rebuilt their retry policy for token economics are paying a hidden tax that scales with the difficulty of the queries they were already most worried about — the long ones, the agentic ones, the ones with deep tool chains. The retry budget that classical resilience engineering hands you back as a safety net is, in an LLM stack, the rope.

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

· 12 min read
Tian Pan
Software Engineer

The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.

Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.

The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.

The Structured-Output Retry Loop Is Your Hidden Compute Waste

· 11 min read
Tian Pan
Software Engineer

Pull up your structured-output dashboard. The number it proudly shows is something like "98.4% schema compliance." That's the success rate — the fraction of requests that produced a valid JSON object on the first try. The team built a retry wrapper for the other 1.6%, shipped it, and moved on. Two quarters later, the inference bill is up 15% on a request volume that grew by 4%. The CFO wants a story. The engineers don't have one, because the dashboard that tracks structured-output success doesn't track structured-output cost.

Here's the part the dashboard is hiding: the failure path is not a single retry. The first re-prompt fixes the missing enum field but introduces a malformed nested array. The second re-prompt fixes the array but drops a required key. The third pass finally validates, but by then the request has burned four full inference calls plus the original generation, and your per-request token meter shows the sum, not the loop. From the meter's perspective it's one expensive request. From the cost line's perspective it's a stochastic loop you never priced.

This post is about what that loop actually does to your compute budget, why your existing observability can't see it, and the disciplines that make it visible and bounded.

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

· 11 min read
Tian Pan
Software Engineer

Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.

The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.

Tokenizer Drift: Your Local Counter Lies, the Bill Tells the Truth

· 9 min read
Tian Pan
Software Engineer

A team I know spent three weeks chasing a "context truncation" bug that only fired in production for Japanese customers. Their CI fixtures were English. Their tiktoken count said the prompt fit in 8K with a 600-token margin. The provider's invoice said the request had been rejected for exceeding the limit. The two numbers were off by 11%, the safety margin lived inside that 11%, and nobody had ever measured the disagreement on CJK text. The fix wasn't a new model — it was throwing away the local counter as a source of truth.

That's the subtle, expensive shape of tokenizer drift: not a single wrong number, but a class of small systematic errors that accumulate at the boundaries you forgot to test. The local counter in your IDE, the budget calculator in your gateway, the rate-limit estimator in your retry middleware, and the authoritative count the provider charges against — none of these agree, and the gap widens exactly where your users live.

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

· 11 min read
Tian Pan
Software Engineer

The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.

This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.

The Agent Flight Recorder: Capture These Fields Before Your First Incident

· 12 min read
Tian Pan
Software Engineer

The first time an agent goes sideways in production — it deletes the wrong row, emails the wrong customer, burns $400 of inference on a single task, or tells a regulated user something legally exposed — the team opens the logs and discovers what they actually have: a CloudWatch stream of tool-call names with truncated arguments, a "user prompt" field that captured only the latest turn, and no record of which model version actually ran. The provider rolled the alias forward two weeks ago. The system prompt lives in a config service that wasn't snapshotted. Temperature wasn't logged because the framework default was 0.7 and "everyone knows that." The tool result that triggered the bad action exceeded the log line size and got truncated to "...".

You cannot reconstruct the decision. You can only guess. Six months later you have a pile of "why did it do that" reports with no answers, and the team starts treating the agent like weather — something that happens to you, not something you debug.

The flight recorder discipline is the cheapest thing you will ever ship that prevents this, and the most expensive thing you will ever ship if you wait until the first incident to start. The fields below are the bare minimum, the storage shape is non-negotiable, and the sampling and privacy boundaries have to be designed alongside — not retrofitted.