330 posts tagged with "observability"

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget

April 28, 2026 · 10 min read

Software Engineer

A team I talked to recently built what they called a "defense in depth" pipeline for their AI assistant. An input classifier checked for prompt injection. A jailbreak filter scanned for adversarial patterns. The model generated a response. An output moderation pass scanned the result. A refusal detector checked whether the model had punted, and if so, a reformulation step re-asked the question with a softer framing. The eval suite said the prompt produced answers in 1.4 seconds. Real users were waiting 3.8 seconds at the median and over 9 seconds at the p95.

Every safety layer is a round trip. Every round trip has a network hop, a queue time, a model load, and a decode. When you stack them serially in front of and behind the generative call, the latency budget you priced your product on dissolves — and almost no one accounted for it during design review. Worse: the slowest, most expensive path through your pipeline is the one that triggers on safety-adjacent prompts, which is exactly the long tail your safety story exists to handle. You are silently subsidizing that tail from the average user's bill.

The Reranker Is the Silent Second Model Your RAG Eval Never Measures

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

A typical RAG pipeline ships with two models, not one. The retriever pulls 50 to 100 candidates from the vector store, and a reranker — a cross-encoder, an LLM-as-judge prompt, or a hybrid — re-scores those candidates and hands the top 5 to the answer model. Your eval suite measures end-to-end answer quality. It measures retriever recall@k. It does not measure the reranker. So when the reranker quietly drifts, the dashboard renders "answer quality dropped 4 points" with no causal arrow, and the team spends three days debugging a prompt that is not the problem.

The reranker is the silent second model. It sits between the retriever and the generator, it has its own scoring distribution, its own prompt (if it's LLM-based) or its own weights (if it's a cross-encoder), and it can regress independently of every other component. Most teams never grade it in isolation. The eval suite they wrote treats the pipeline like one model with a long context window, when it's actually two models in series with an interface neither team owns.

Retries Aren't Free: The FinOps Math of LLM Retry Policies

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

A team I talked to last quarter found a $4,200 line item on their inference invoice that nobody could explain. The dashboard showed normal traffic. The latency graphs were flat. The cause turned out to be a single agent stuck in a polite retry loop for six hours, replaying a 40k-token tool chain with exponential backoff that capped out at thirty seconds and then started over. The retry policy was lifted verbatim from an internal SRE handbook written in 2019 for a JSON-over-HTTP service. It worked perfectly. It worked perfectly for the wrong system.

This is the bill that does not show up in capacity-planning spreadsheets. The retry-policy patterns the industry standardized on for stateless REST APIs assume three things that LLM workloads quietly violate: failures are transient, the cost of one extra attempt is bounded, and a retry has a meaningful chance of succeeding. Each assumption was load-bearing. Each one is now wrong, and the variance the cost model never captured is sitting at the bottom of every monthly invoice.

The teams that have not rebuilt their retry policy for token economics are paying a hidden tax that scales with the difficulty of the queries they were already most worried about — the long ones, the agentic ones, the ones with deep tool chains. The retry budget that classical resilience engineering hands you back as a safety net is, in an LLM stack, the rope.

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.

Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.

The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.

The Structured-Output Retry Loop Is Your Hidden Compute Waste

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

Pull up your structured-output dashboard. The number it proudly shows is something like "98.4% schema compliance." That's the success rate — the fraction of requests that produced a valid JSON object on the first try. The team built a retry wrapper for the other 1.6%, shipped it, and moved on. Two quarters later, the inference bill is up 15% on a request volume that grew by 4%. The CFO wants a story. The engineers don't have one, because the dashboard that tracks structured-output success doesn't track structured-output cost.

Here's the part the dashboard is hiding: the failure path is not a single retry. The first re-prompt fixes the missing enum field but introduces a malformed nested array. The second re-prompt fixes the array but drops a required key. The third pass finally validates, but by then the request has burned four full inference calls plus the original generation, and your per-request token meter shows the sum, not the loop. From the meter's perspective it's one expensive request. From the cost line's perspective it's a stochastic loop you never priced.

This post is about what that loop actually does to your compute budget, why your existing observability can't see it, and the disciplines that make it visible and bounded.

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.

The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.

Tokenizer Drift: Your Local Counter Lies, the Bill Tells the Truth

April 28, 2026 · 9 min read

Tian Pan

Software Engineer

A team I know spent three weeks chasing a "context truncation" bug that only fired in production for Japanese customers. Their CI fixtures were English. Their tiktoken count said the prompt fit in 8K with a 600-token margin. The provider's invoice said the request had been rejected for exceeding the limit. The two numbers were off by 11%, the safety margin lived inside that 11%, and nobody had ever measured the disagreement on CJK text. The fix wasn't a new model — it was throwing away the local counter as a source of truth.

That's the subtle, expensive shape of tokenizer drift: not a single wrong number, but a class of small systematic errors that accumulate at the boundaries you forgot to test. The local counter in your IDE, the budget calculator in your gateway, the rate-limit estimator in your retry middleware, and the authoritative count the provider charges against — none of these agree, and the gap widens exactly where your users live.

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.

This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.

The Agent Flight Recorder: Capture These Fields Before Your First Incident

April 27, 2026 · 13 min read

Tian Pan

Software Engineer

The first time an agent goes sideways in production — it deletes the wrong row, emails the wrong customer, burns $400 of inference on a single task, or tells a regulated user something legally exposed — the team opens the logs and discovers what they actually have: a CloudWatch stream of tool-call names with truncated arguments, a "user prompt" field that captured only the latest turn, and no record of which model version actually ran. The provider rolled the alias forward two weeks ago. The system prompt lives in a config service that wasn't snapshotted. Temperature wasn't logged because the framework default was 0.7 and "everyone knows that." The tool result that triggered the bad action exceeded the log line size and got truncated to "...".

You cannot reconstruct the decision. You can only guess. Six months later you have a pile of "why did it do that" reports with no answers, and the team starts treating the agent like weather — something that happens to you, not something you debug.

The flight recorder discipline is the cheapest thing you will ever ship that prevents this, and the most expensive thing you will ever ship if you wait until the first incident to start. The fields below are the bare minimum, the storage shape is non-negotiable, and the sampling and privacy boundaries have to be designed alongside — not retrofitted.

Agent SLOs Without Ground Truth: An Error Budget for Outputs You Can't Grade in Real Time

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

Your agent platform has met its 99.9% "response success" SLO every quarter for a year. Tickets are up 40%. Retention on the agent-touched cohort is down. The on-call rotation is bored, the product manager is panicking, and the executive review keeps asking why the dashboard says everything is fine while the support queue says everything is on fire. The dashboard isn't lying. It's just measuring the wrong thing — because the SRE who wrote the SLO defined success as "the model API returned 200," and that was the only definition of success the telemetry could express in the first place.

This is the central problem of agent reliability engineering: the success signal is not a status code. It is a judgment about whether the agent did the right thing for a specific task, and that judgment is unavailable at request time, often unavailable at session time, and sometimes only resolvable days later when the user files a ticket, edits the output, or quietly stops coming back. You cannot put a 200-vs-500 boolean on a column that doesn't exist yet.

The reflex is to wait for ground truth before declaring an SLO. This is wrong. Reliability does not pause while you build a labeling pipeline. The right move is to write an error budget against proxies you know are imperfect, name them as proxies, set the policy that governs how the team responds when they trip, and back-fill ground truth into the calculation as you produce it. This post is about how to do that without lying to yourself.

Where the 30 Seconds Went: Latency Attribution Inside an Agent Step Your APM Can't See

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The dashboard says agent.run = 28s at p95. Users say the feature feels broken. The on-call engineer opens the trace, sees a single fat bar with no children worth investigating, and starts guessing. By the time someone has rebuilt enough mental model to know whether the bottleneck is the model, the retriever, or a tool call that nobody added a span to, the incident has aged into a backlog ticket and the user has given up.

This is the failure mode at the heart of agent operations in 2026: classical APM treats an agent step as a black box, and "agent latency" is not a metric — it is the sum of seven metrics that decompose the wall-clock time differently depending on what the agent decided to do that turn. A team that doesn't expose those seven numbers ships a feature whose slowness everyone can feel and nobody can fix.

Agent Traffic Is Not Human Traffic: Designing APIs for Two Species of Caller

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The API you shipped two years ago was designed for a single species of caller: a person, behind a browser or a mobile client, clicking once and waiting for a response. That assumption is now wrong on roughly half of every interesting endpoint. The other half of the traffic is agents — your own, your customers', third-party integrations using your endpoints as tools — and they have different physics. They burst. They retry forever. They parallelize. They parse error strings literally. They act on behalf of a human who will not be available to clarify intent when something breaks.

Most of the production weirdness landing in postmortems this year traces back to one architectural mistake: treating both species as the same caller class. Rate limits sized for human pacing get blown apart by an agent's parallel fanout. Error messages designed to be human-readable get parsed wrong by an agent that retries forever on a 400. Idempotency assumptions that humans satisfy by default get violated when an agent retries the same payload from a recovered checkpoint. Auth logs lose the ability to distinguish "the user did this" from "the user's agent did this on the user's behalf."

The fix is not a smarter WAF or a bigger rate-limit bucket. It is a deliberate API design that names two caller classes, treats their traffic as different shapes, and records the delegation chain so accountability survives the indirection.

About Tian Pan