Skip to main content

299 posts tagged with "observability"

View all tags

The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task

· 10 min read
Tian Pan
Software Engineer

A quarter into the rollout, the AI team reported a 25% reduction in average cost-per-API-call. The support team reported that average handle time on AI-routed tickets had drifted from four turns to seven. Both numbers were correct. Both teams were measuring the system they had been told to optimize. The finance team, sitting between them, could not reconcile the dashboards because neither one was denominated in the thing the customer was actually paying for: a resolved ticket. The cost-per-call had gone down. The cost-per-resolved-task had gone up 40%. Nobody owned that number, so nobody was watching it move.

This is the most common unit-economics failure I see in agentic deployments, and it is not a measurement bug. It is a definitional one. The vendor's pricing page exposes cost-per-call because that is the unit they bill. The spreadsheet line item inherits that unit because it fits in a cell. The engineering team optimizes against the unit they were given. By the time the gap between API economics and business economics becomes visible, it has been compounding for a quarter, and the agent has been quietly trained on the wrong loss function the entire time.

The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped

· 11 min read
Tian Pan
Software Engineer

A long-running agent generates a plan at step 3. The plan reads something like: "if the order returned by get_order in step 1 has status shipped, send the customer a tracking email; otherwise open a refund ticket." The agent confidently picks the email branch. The customer never received a tracking number, because the order was actually in pending. You go to the trace expecting to find a hallucination. What you find is worse: the step-1 tool result is no longer in context. The pruner evicted it between step 2 and step 3 — it ranked low on recency and there was a 12KB transcript to make room for. The plan still ran. The branch was still chosen. The decision now points at evidence that does not exist.

This is not a model failure in the usual sense. The model produced a syntactically valid plan, executed it in order, and made a branch decision. The branch was made against a fact that used to be in context and is not anymore. The chain of thought encoded the condition (if status == "shipped"); the actual status got dropped on the way to the step that needed it. The plan looks deterministic, but it has been quietly cut loose from its evidence.

The Agent Runbook Your Incident Commander Could Not Execute

· 10 min read
Tian Pan
Software Engineer

The page fires at 02:17 local time. The on-call SRE pulls up the agent runbook on their phone and reads step one: "check the agent's tool-call traces for anomalous tool usage." They open the link. They hit an SSO prompt for a workspace they do not belong to. Step two says inspect the prompt-construction logs; same wall. Step three says roll back to the previous prompt version, but the deploy permission is scoped to a team they are not on. By the time they figure out which Slack channel to escalate to and wake up the AI team's product manager because she is the only person they can find at 02:17, ninety minutes have passed and the customer-visible regression is still serving wrong answers.

The post-mortem will identify the access gap as the proximate cause. The deeper discomfort is that the runbook reads fine in daylight and runs blocked at night, because the person who wrote it has access the person who executes it does not.

The Annotation Queue Your Humans Quietly Stopped Reading

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline emits 800 traces per week for human review. Your annotators have about ninety minutes a week budgeted for it. They open the queue, grade the first three, mark a few more as "skip," and close the tab. The leaderboard you stare at on Monday morning is now a survey of which traces happened to land near the top of the list, not a measurement of system quality.

This is not a labeling problem. It is a throughput problem dressed up as a quality problem, and it is one of the quietest ways an evaluation program degrades. The traces still flow. The dashboards still render. The number still moves. What you do not see is that the denominator of your "human-graded eval score" silently shrank to a handful of items chosen by an ordering function nobody designed on purpose.

The Budget Cap That Fires After the Action Already Shipped

· 9 min read
Tian Pan
Software Engineer

A single power user burns through your monthly token budget by 9am on day three. The kill-switch fires correctly — the gateway returns 429, the model calls stop, the bill flatlines. Meanwhile the agent has already booked the flight, sent the email confirmation, and closed the support ticket as resolved. The dashboard says "spend halted." The user says "why did you charge me for a trip I never asked for." Both are right. The budget cap stopped the model from thinking. It did not stop the world from changing.

This is the failure mode that almost every agent budget guardrail ships with: the cap is a signal in the spend plane, but the damage lives in the action plane, and the two planes were wired up with no shared transaction boundary. Telling the model to stop is not the same as telling the world to undo what the model just did.

The Chain-of-Thought You Stripped to Save Tokens That Hid an Evidence Requirement

· 10 min read
Tian Pan
Software Engineer

A platform team shipped a prompt refactor that cut average response cost by thirty-two percent. The change was simple: strip the "explain your reasoning" preamble, ask the model to return only the JSON object, and drop the post-processing step that parsed the rationale out of the model's prose. The dashboard turned green. The unit economics page in the quarterly review went from yellow to gold. Nobody on the platform team thought to consult the risk team, because no part of the change touched the answer the customer received.

Two quarters later, a regulated customer's auditor requested the decision rationale for a denied-loan letter from a date six months prior. The team pulled the trace. The input was there. The output was there. The reasoning was gone — not because anyone deleted it, but because it had stopped being produced the day the refactor shipped. The customer's compliance program had been operating on the assumption that the rationale was somewhere in the trace store; the platform team had been operating on the assumption that the rationale was nobody's problem because the customer-facing answer was unchanged. Both assumptions were correct in isolation. Together they cost the customer a regulatory finding and the platform team a contract renewal.

The Escalation Path That Routes Back to the Agent

· 10 min read
Tian Pan
Software Engineer

The escalation tool was the safety net. The agent's confidence dropped below threshold, it called escalate_to_human, and the request slid into a ticket queue with a polite "a specialist will follow up shortly" reply to the user. Engineering closed the loop on the launch checklist. The on-call calendar listed humans on the receiving end.

Six months later, an audit walked the path. The escalation tool opened a Zendesk ticket. The Zendesk queue was triaged by a triage agent the support team had stood up to keep response times within SLA. The triage agent, finding no policy match it could resolve directly, called its own delegate_to_specialist tool — which routed the case to a specialist agent. The specialist agent, when uncertain, called escalate_to_human. The trace was a closed circuit. No human had touched any of the escalations the audit sampled. The human-in-the-loop the launch doc described did not exist.

The escalation interface had not failed. It had been honored at every hop. What failed was the assumption that the receiving system was a person.

The Eval Harness Whose Judge Model Was Upgraded Silently

· 11 min read
Tian Pan
Software Engineer

A six-point lift across every eval category arrives the same week you shipped a prompt change. The room reads it as proof the change worked. Three weeks later, someone notices the lift also showed up in categories the prompt change could not possibly have touched — a control set you keep specifically to detect this — and the lift is uniformly distributed, the kind of shape a real product improvement never has. The judge model was rolled out under the same endpoint name on a Tuesday. Your scores moved before your system did.

This is the failure mode that breaks LLM-as-a-judge eval pipelines more quietly than any of the failure modes the literature warns about. Not bias, not position effects, not self-preference — those are properties of a judge at a point in time, and your eval design probably already accounts for them. The one that gets you is the judge changing while you're not looking, while your endpoint name and your eval code and your dashboards all keep claiming nothing happened. The unit of measurement shifted under a stable label. Every comparison across the migration boundary is now confounded, and you cannot decompose the delta into "our system improved" and "the ruler got more generous" because you never built the instrument to do that decomposition.

The Eval Set That Sampled Production Traffic at 3am EST

· 10 min read
Tian Pan
Software Engineer

A team I worked with had an eval set that quietly drifted into being a survey of their batch automation. The sampling cron ran at 3am Eastern, scooped 5,000 traces out of the production log table, and dropped them into the eval corpus. The leaderboard was clean. The new prompt won by four points. They shipped it. Within a day, the support queue filled with a kind of complaint they had never seen during regression testing — pricing questions that the model now hedged on, in a customer segment whose entire workday started after the eval window closed.

The eval was not wrong about what it measured. It was wrong about who it measured. At 3am EST, the production fleet was dominated by overnight batch retries, scheduled report generation, and a handful of APAC daytime sessions that mostly asked navigational questions. The new prompt was genuinely better on that slice. The slice was twelve percent of weekly traffic and zero percent of revenue-weighted traffic. Nobody had asked the question "what shape of user is in this dataset" because the dataset was constructed by a cron job that ran when the warehouse was quietest, and quietness was the only sampling criterion anyone had thought to optimize for.

The Heavy Tail Your Token Forecast Never Priced

· 9 min read
Tian Pan
Software Engineer

The cost forecast for your AI feature was modeled on a 50-user pilot. Those users typed three-sentence prompts because that is what people type into a beta they were asked to evaluate. Production launched, you crossed ten thousand users, and the finance team flagged that your model bill is running at three times the per-user number from the deck. You went looking for the bug. There is no bug. Your pilot was sampling from one distribution and production is sampling from another, and the difference between them is a long tail of users who learned about your product on Twitter and are pasting thirty kilobytes of unstructured context they screenshotted from a thread.

This is the same financial mistake every consumer internet company learned in the 2010s, transplanted onto LLM economics. The pilot's median user is not the production p99.5, and a token cost model that uses the mean as its forecasting input has already lost the argument with the bill.

The Interrupt UI That Taught Your Users to Never Interrupt the Agent

· 10 min read
Tian Pan
Software Engineer

The interrupt button on your streaming agent has a 0.4% click rate. The product team reads that number and concludes the feature is working as intended — most generations don't need to be interrupted, the implementation is fine, ship it, move on. The actual reading is that the interrupt button taught your users not to press it. Within a week of using the product, they figured out that pressing stop discards the partial response, clears the context, and dumps them back at an empty input box. The lesson they learned is to wait through a bad answer rather than risk losing the thread.

That 0.4% is not a usage signal. It is an aversion signal. Your users are not happy with the answers — they are afraid of the cost of trying to redirect them, and their adaptation is to sit quietly while the agent finishes saying something they already know is wrong. The engineering team treated "stop generation" as a model-call cancellation. The user treated it as "redirect, don't restart." The two definitions never met, and the product shipped a feature that quietly drained user agency from every long-running conversation.

The Latency-Budget Router That Was a Quality-Loss Router by Another Name

· 10 min read
Tian Pan
Software Engineer

A model router that optimizes a single loss function will deliver exactly what that loss function asks for, and nothing else. When the function is "stay under the p95 latency target," every query that would have benefited from extended reasoning gets snapped to the cheapest path the router can defend, because the fast model returns under the SLO and the slow-but-correct model would not. The latency dashboard turns green. The aggregate eval moves a fraction of a point and the team rounds it to noise. The per-slice view nobody graphs is where the actual regression lives: concentrated in the multi-step, ambiguous, and out-of-distribution queries that should have been routed to reasoning and instead got the model that finishes fast and is wrong with confidence.

This is not a routing bug. The router is doing exactly what it was built to do. The bug is in the framing — a system whose optimizer is denominated entirely in latency will produce quality regressions invisible to the metric the team is paid to keep green. It will then ship those regressions silently, because the people watching the dashboard are not the people watching the answers.