Skip to main content

330 posts tagged with "observability"

View all tags

The Tokenizer Upgrade That Invalidated Every Prompt Cache Prefix

· 9 min read
Tian Pan
Software Engineer

The release notes were two lines long. "Improved multilingual tokenization. No breaking changes to model outputs." Nine words. Your evals confirmed it: same prompts, same completions, same scores. Your platform team signed off on the upgrade Friday afternoon. By Tuesday morning your cache hit rate had collapsed from 80% to 4%, your daily inference bill had quadrupled, and the on-call engineer who paged you at 6am could not find a single line of your code that had changed.

Nothing in your code had changed. The provider had shipped a new tokenizer that split one Unicode glyph one byte differently than the old one. Every cached prefix in your system was now fingerprinted against a token sequence that no longer existed. The model behaved identically — that was true. The cache layer, which the release notes did not mention, paid the bill in full.

The Tool Schema Migration That Broke Your Agent's Retries for Two Weeks

· 11 min read
Tian Pan
Software Engineer

The deprecation notice went out on a Tuesday. The downstream team rotated the response shape on their search tool — results[].snippet became results[].excerpt, a clean rename, six-week window, banner in the docs, three reminder emails to the engineering list. Every human consumer migrated. The agent did not, because the agent does not read email. For fourteen days the retry loop quietly parsed the new payload, found the field it was looking for missing, raised a KeyError, and counted that as a retryable failure. The retry hit the same endpoint, got the same new shape, raised the same error, gave up after three attempts, and returned an apologetic message to the user. The retry budget dashboard stayed green the entire time — retries were never exhausted, they were just permanently failing within budget. Success rate, measured at the tool layer, sat at zero on that path. Nobody looked because there was no page.

This is the shape of the failure that gets the most engineers in 2026: not the dramatic outage, but the silent contract drift where a human-facing migration runs to completion and the agent-facing one never starts because nobody knew there was one to run. The deprecation worked exactly as designed for the consumers it was designed for. The agent was a consumer nobody listed.

The traceparent header your gateway dropped between LLM call and tool execution

· 11 min read
Tian Pan
Software Engineer

A user reports that the agent answered correctly but the database update never happened. You open your observability tool, search for the trace ID stamped on the user-facing conversation, and find a clean tree — five LLM calls, four tool decisions, a final response. No errors. Then you search for the tool service that owns the database write, and you find another trace, with the same wall-clock window but a different trace ID, a different root span, and no link back. You search the gateway logs. Three more orphan traces. The agent run that looked like a single coherent interaction in the chat UI fragmented, in your tracing backend, into a forest.

The header that should have stitched it together is traceparent. It is a 55-byte W3C-standard string that every span in a distributed system uses to identify its parent. It is also, in most production LLM agent stacks, dropped at least once between the user's request and the side effect the user actually wanted.

The Transcription Confidence Score Your Agent Trusted After the Vendor's Recalibration

· 10 min read
Tian Pan
Software Engineer

The voice agent had a gate. Anything above 0.85 transcription confidence went straight to the planning step; anything below got routed to a human. The threshold had been tuned six months earlier against a labeled corpus of real customer calls, frozen into a config file, and forgotten. For six months it did exactly what it was supposed to do. Then the transcription provider shipped a model upgrade — same API, same response shape, same latency band, same documented accuracy — and over the next two weeks the agent started authorizing wire transfers to the wrong people.

"Transfer $50 to mom" became "transfer $5,000 to Tom." The new transcript came back with a confidence of 0.91, well above the gate. The downstream planner saw a confident transcript and acted on it. The customer's appeal eventually surfaced the bug, but by then the support queue had filtered out a week's worth of similar incidents as fraud disputes. The post-mortem traced the gap to a single decision the team had never made explicitly: that 0.85 from the old model and 0.85 from the new model were the same number.

The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task

· 10 min read
Tian Pan
Software Engineer

A quarter into the rollout, the AI team reported a 25% reduction in average cost-per-API-call. The support team reported that average handle time on AI-routed tickets had drifted from four turns to seven. Both numbers were correct. Both teams were measuring the system they had been told to optimize. The finance team, sitting between them, could not reconcile the dashboards because neither one was denominated in the thing the customer was actually paying for: a resolved ticket. The cost-per-call had gone down. The cost-per-resolved-task had gone up 40%. Nobody owned that number, so nobody was watching it move.

This is the most common unit-economics failure I see in agentic deployments, and it is not a measurement bug. It is a definitional one. The vendor's pricing page exposes cost-per-call because that is the unit they bill. The spreadsheet line item inherits that unit because it fits in a cell. The engineering team optimizes against the unit they were given. By the time the gap between API economics and business economics becomes visible, it has been compounding for a quarter, and the agent has been quietly trained on the wrong loss function the entire time.

The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped

· 11 min read
Tian Pan
Software Engineer

A long-running agent generates a plan at step 3. The plan reads something like: "if the order returned by get_order in step 1 has status shipped, send the customer a tracking email; otherwise open a refund ticket." The agent confidently picks the email branch. The customer never received a tracking number, because the order was actually in pending. You go to the trace expecting to find a hallucination. What you find is worse: the step-1 tool result is no longer in context. The pruner evicted it between step 2 and step 3 — it ranked low on recency and there was a 12KB transcript to make room for. The plan still ran. The branch was still chosen. The decision now points at evidence that does not exist.

This is not a model failure in the usual sense. The model produced a syntactically valid plan, executed it in order, and made a branch decision. The branch was made against a fact that used to be in context and is not anymore. The chain of thought encoded the condition (if status == "shipped"); the actual status got dropped on the way to the step that needed it. The plan looks deterministic, but it has been quietly cut loose from its evidence.

The Agent Runbook Your Incident Commander Could Not Execute

· 10 min read
Tian Pan
Software Engineer

The page fires at 02:17 local time. The on-call SRE pulls up the agent runbook on their phone and reads step one: "check the agent's tool-call traces for anomalous tool usage." They open the link. They hit an SSO prompt for a workspace they do not belong to. Step two says inspect the prompt-construction logs; same wall. Step three says roll back to the previous prompt version, but the deploy permission is scoped to a team they are not on. By the time they figure out which Slack channel to escalate to and wake up the AI team's product manager because she is the only person they can find at 02:17, ninety minutes have passed and the customer-visible regression is still serving wrong answers.

The post-mortem will identify the access gap as the proximate cause. The deeper discomfort is that the runbook reads fine in daylight and runs blocked at night, because the person who wrote it has access the person who executes it does not.

The Annotation Queue Your Humans Quietly Stopped Reading

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline emits 800 traces per week for human review. Your annotators have about ninety minutes a week budgeted for it. They open the queue, grade the first three, mark a few more as "skip," and close the tab. The leaderboard you stare at on Monday morning is now a survey of which traces happened to land near the top of the list, not a measurement of system quality.

This is not a labeling problem. It is a throughput problem dressed up as a quality problem, and it is one of the quietest ways an evaluation program degrades. The traces still flow. The dashboards still render. The number still moves. What you do not see is that the denominator of your "human-graded eval score" silently shrank to a handful of items chosen by an ordering function nobody designed on purpose.

The Budget Cap That Fires After the Action Already Shipped

· 9 min read
Tian Pan
Software Engineer

A single power user burns through your monthly token budget by 9am on day three. The kill-switch fires correctly — the gateway returns 429, the model calls stop, the bill flatlines. Meanwhile the agent has already booked the flight, sent the email confirmation, and closed the support ticket as resolved. The dashboard says "spend halted." The user says "why did you charge me for a trip I never asked for." Both are right. The budget cap stopped the model from thinking. It did not stop the world from changing.

This is the failure mode that almost every agent budget guardrail ships with: the cap is a signal in the spend plane, but the damage lives in the action plane, and the two planes were wired up with no shared transaction boundary. Telling the model to stop is not the same as telling the world to undo what the model just did.

The Chain-of-Thought You Stripped to Save Tokens That Hid an Evidence Requirement

· 10 min read
Tian Pan
Software Engineer

A platform team shipped a prompt refactor that cut average response cost by thirty-two percent. The change was simple: strip the "explain your reasoning" preamble, ask the model to return only the JSON object, and drop the post-processing step that parsed the rationale out of the model's prose. The dashboard turned green. The unit economics page in the quarterly review went from yellow to gold. Nobody on the platform team thought to consult the risk team, because no part of the change touched the answer the customer received.

Two quarters later, a regulated customer's auditor requested the decision rationale for a denied-loan letter from a date six months prior. The team pulled the trace. The input was there. The output was there. The reasoning was gone — not because anyone deleted it, but because it had stopped being produced the day the refactor shipped. The customer's compliance program had been operating on the assumption that the rationale was somewhere in the trace store; the platform team had been operating on the assumption that the rationale was nobody's problem because the customer-facing answer was unchanged. Both assumptions were correct in isolation. Together they cost the customer a regulatory finding and the platform team a contract renewal.

The Deterministic Seed Your Eval Suite Set That Your Provider Quietly Ignored

· 11 min read
Tian Pan
Software Engineer

You set seed=42. You set temperature=0. You logged the run, posted the dashboard, signed off on the model swap. The next morning the rerun returned a different number on the same prompts, and the explanation you reached for — "must be sampling noise" — was wrong twice over: there was no sampling, and the noise was structural. The seed left your client, the gateway threw it away, the kernel batched your request next to seventeen unrelated ones, and the floating-point reduction order changed under you. Your "reproducible" benchmark was always within one batch of being a different benchmark.

This failure mode is quiet because every layer in the stack is technically correct. The SDK accepts the seed. The provider documents the seed. The model returns a system_fingerprint. The eval harness logs all three. Nothing 5xx's, nothing warns, nothing protests. The number on the dashboard just shifts, and the team rationalizes the shift as the kind of jitter that always existed — because they have no instrument that can tell them whether they're looking at stochastic decoding or at a backend rotation that invalidated three weeks of comparisons.

The Escalation Path That Routes Back to the Agent

· 10 min read
Tian Pan
Software Engineer

The escalation tool was the safety net. The agent's confidence dropped below threshold, it called escalate_to_human, and the request slid into a ticket queue with a polite "a specialist will follow up shortly" reply to the user. Engineering closed the loop on the launch checklist. The on-call calendar listed humans on the receiving end.

Six months later, an audit walked the path. The escalation tool opened a Zendesk ticket. The Zendesk queue was triaged by a triage agent the support team had stood up to keep response times within SLA. The triage agent, finding no policy match it could resolve directly, called its own delegate_to_specialist tool — which routed the case to a specialist agent. The specialist agent, when uncertain, called escalate_to_human. The trace was a closed circuit. No human had touched any of the escalations the audit sampled. The human-in-the-loop the launch doc described did not exist.

The escalation interface had not failed. It had been honored at every hop. What failed was the assumption that the receiving system was a person.