348 posts tagged with "observability"

LLM-as-Judge Drift: When Your Evaluator Upgrades and All Your Numbers Move

April 23, 2026 · 11 min read

Software Engineer

A regression suite that flips green-to-red without a single prompt change is usually one of three things: a broken test harness, a flaky retrieval store, or a judge that learned new taste over the weekend. The third one is the most common and the least debugged, because no commit in your repo caused it. The scoring model got a silent quality refresh, and every score you compare against last month's dashboard is now denominated in a different currency.

This is the uncomfortable part of LLM-as-judge: you have two moving models, not one. The candidate model is the thing you ship; the judge model is the thing that tells you how the candidate is doing. When both evolve independently, score deltas stop meaning what they used to, and the dashboard that your PM refreshes every morning quietly lies.

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

April 23, 2026 · 8 min read

Tian Pan

Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

Eval Passed, With All Tools Mocked: Why Your Agent's Hardest Failures Never Reach the Harness

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

Your agent hits 94% on the eval suite. Your on-call rotation is on fire. Nobody in the room is lying; both numbers are honest. What's happening is that the harness is testing a prompt, and production is testing an agent, and those are two different artifacts that happen to share weights.

Mocked-tool evals are almost always how this gap opens. You stub search_orders, charge_card, and send_email with canned JSON, feed the model a user turn, and assert on the final response. The run is cheap, deterministic, and reproducible — every property a CI system loves. It is also silent on tool selection, latency, rate limits, partial failures, and retry behavior, which is to say silent on the set of failures that dominate post-incident reviews.

The Model Bill Is 30% of Your Inference Cost

April 23, 2026 · 8 min read

Tian Pan

Software Engineer

A finance lead at a mid-sized AI company told me last quarter they had "optimized their LLM spend" by switching their agent backbone from Sonnet to Haiku. The token bill dropped 22%. The total inference cost per resolved ticket went down 4%. When we pulled the full decomposition, the model line item was roughly a third of the per-request cost. Retrieval, reranking, observability, retry amplification, and the human-in-the-loop review queue ate the rest — and none of those got cheaper when they swapped models.

This is the most common accounting error I see in AI teams right now. Token cost is the line item on the invoice you pay every month, so it becomes the number everyone optimizes. But for any non-trivial production system — RAG, agents, anything with tool use or evaluation gates — the model inference is often 30 to 50% of the real unit economics. The rest sits in places your engineering dashboard doesn't surface and your finance team doesn't categorize as "AI spend."

Plan-and-Execute Is Marketing, Not Contract: Plan Adherence as a First-Class SLI

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The agent printed a five-step plan. Step three said "fetch the user's billing history from the invoices service." The trace shows step three actually called the orders service, joined a stale customer table, and produced a number that looked right. The output passed the eval. The post-mortem found the regression six weeks later, when finance noticed the dashboard had quietly diverged from source-of-truth by 4%.

Nobody wrote a bug. The planner wrote a contract the executor never signed.

This is the failure mode plan-and-execute architectures bury under their own architectural elegance. The pattern was sold as a way to give agents long-horizon coherence: a strong model drafts a plan, weaker models execute steps, the plan acts as a scaffold. In practice the plan is a marketing artifact — a plausible-looking story emitted at t=0, then promptly invalidated by every interesting thing that happens at t>0. The trace shows the plan. The trace shows the actions. Almost nobody is measuring the distance between them.

Rate Limit Hierarchy Collapse: When Your Agent Loop DoSes Itself

April 23, 2026 · 12 min read

Tian Pan

Software Engineer

The bug report says the service is slow. The dashboard says the service is healthy. Token-per-minute usage is at 62% of the tier cap, well inside the green band. Then you open the traces and see the shape: one user request spawned a planner step, which emitted eleven parallel tool calls, four of which were search fan-outs that each triggered sub-agents, which each called three tools in parallel — and that single "request" is now pounding your own token bucket from forty-seven different workers at the same time. The other ninety-nine users of your product are stuck behind it, getting 429s they never earned. Your agent is DoSing itself, and the rate limiter is doing exactly what you told it to.

This is rate limit hierarchy collapse. You bought a perimeter defense designed for HTTP APIs where one request equals one unit of work, then wired it in front of a system where one request means a tree of unknown depth and unbounded branching factor. The single-bucket model doesn't just fail to protect — it fails invisibly, because your aggregate numbers never breach anything. The damage happens in the tails, in correlated bursts, and in the heads-down users who happen to be adjacent in time to a heavy one.

Retry Amplification: How a 2% Tool Error Rate Becomes a 20% Agent Failure

April 23, 2026 · 13 min read

Tian Pan

Software Engineer

The spreadsheet on the oncall doc said the search tool had a 2% error rate. The incident review said the agent platform had a 20% failure rate during the three-hour window. Nobody disagreed with either number. The search team was not at fault. The platform team did not ship a bug. The gap between the two numbers is the whole story, and it is a story about arithmetic, not engineering incompetence.

Retry logic is one of the most borrowed and least adapted patterns in agent systems. Teams copy tenacity decorators from their REST client, stack them at the SDK, the gateway, and the agent loop, and ship. Each layer is individually reasonable. The composition is a siege weapon pointed at the flakiest dependency in the fleet, and it fires hardest at the exact moment that dependency needs the load to drop.

This post is about how that math works, why agent loops amplify it harder than request-response systems, and the retry discipline that keeps transient blips from becoming correlated outages with your own logo on them.

Sampling Bias in Agent Traces: Why Your Debug Dataset Silently Excludes the Failures You Care About

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The debug corpus your team stares at every Monday is not a representative sample of production. It is an actively biased one, and the bias is in exactly the wrong direction. Head-based sampling at 1% retains the median request a hundred times before it keeps a single rare catastrophic trajectory — and most teams discover this only when a failure mode that has been quietly recurring for months finally drives a refund or an outage, and they go looking for examples in the trace store and find none.

This is not an exotic edge case. It is the default behavior of every observability stack that was designed for stateless web services and then pointed at a long-horizon agent. The same sampling math that worked fine for HTTP request tracing systematically erases the trajectories that matter most when each "request" is a thirty-step plan that may invoke a dozen tools, regenerate three subplans, and consume tens of thousands of tokens before something subtle goes wrong on step twenty-seven.

The fix is not "sample more." Sampling more makes the bill explode without changing the bias — you just get more of what you already had too much of. The fix is to change what you sample, keyed on outcomes you can only know after the trajectory finishes. That requires throwing out the head-based defaults and rebuilding the retention layer around tail signals, anomaly weighting, and bounded reservoirs that survive the long tail of agent execution.

Token Spend Is a Security Signal Your SOC Isn't Watching

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.

That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.

The AI Audit Trail Is a Product Feature, Not a Compliance Checkbox

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

McKinsey's 2025 survey found that 75% of business leaders were using generative AI in some form — but nearly half had already experienced a significant negative consequence. That gap is not a model quality problem. It's a trust problem. And the fastest path to closing it is not more evals, better prompts, or a new frontier model. It's showing users exactly what the agent did.

Most engineering teams treat the audit trail as an afterthought — something you wire up for GDPR compliance or SOC 2, then lock in an internal dashboard that only ops reads. That's the wrong frame. When users can see which tool the agent called, what data it retrieved, and which reasoning branch produced the answer, three things happen: adoption goes up, support escalations go down, and model errors surface days earlier than they would from any backend alert.

The AI Feature Lifecycle Decay Problem: How to Catch Degradation Before Users Do

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature shipped clean. The demo impressed, the launch metrics looked great, and the model benchmarked at 88% accuracy on your test set. Then, about three months later, a customer success manager forwards you a screenshot. The AI recommendation made no sense. You pull the logs, run a quick evaluation, and find accuracy has drifted to 71%. No alert fired. No error was thrown. Infrastructure dashboards showed green the whole time.

This pattern is not a freak occurrence. Research across 32 production datasets found that 91% of ML models degrade over time — and most of the degradation is silent. The systems keep running, the code doesn't change, but the predictions get progressively worse as the real world moves on without the model.

About Tian Pan