330 posts tagged with "observability"

Reachability Analysis for Agent Action Spaces: Eval Coverage for the Branches You Never Tested

April 27, 2026 · 12 min read

Software Engineer

The first time anyone on your team learned that the agent could call revoke_api_key was the morning a well-meaning user typed "this token feels old, can you rotate it for me?" The tool had been registered six months earlier as part of a batch import from the auth team's MCP server. It had passed schema validation, appeared in the catalog enumeration, and then sat. No eval ever invoked it. No production trace ever touched it. Then one prompt, one planner decision, and the incident channel learned the tool existed.

This is the failure mode that hides inside every agent with a non-trivial tool catalog. Forty registered functions and a planner that can compose them produce a reachable graph of plans whose long tail you have never observed. The assumption that "we tested the common paths" papers over the fact that the dangerous branch is, almost by definition, the one you never saw.

Reasoning-Effort Budgeting: When Thinking Tokens Become a Finance Line Item

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The first time your finance team asks why a single user racked up a fifty-cent answer to a one-tenth-of-a-cent question, the call will not be about the model. It will be about the line on the invoice that did not exist twelve months ago: reasoning tokens. They look like output tokens on the bill, they bill at output-token rates on most providers, and they have no natural ceiling. A query that would have produced a four-hundred-token reply on a non-reasoning model can quietly burn eight thousand internal thinking tokens to get there — and the only person who notices is the one reconciling the spend.

For most of the API era, "tokens used" was an honest number. You sent a prompt in, you got a response out, and the bill was a clean function of both. Reasoning models broke that intuition. The model now generates a hidden, billable, internally-only-visible chain of thought before it emits the answer the caller will read, and the size of that chain depends on the model's own assessment of how hard the question was. The user-visible output may be a single sentence. The bill may be for ten pages.

The Router Is the Product: Why Your Cheap Classifier Decides More Behavior Than Your Flagship Model

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I talked to last quarter shipped what they called "the routing project": a tiny BERT classifier in front of their flagship model that decided whether a query was simple enough for a cheaper, faster fallback. It paid for itself in three weeks. The cost dashboards lit up green. The flagship's eval suite — three hundred adversarial cases, weekly grading runs, the works — still passed every Friday.

Six weeks in, retention on a particular product surface dropped four points and nobody could find the cause. The flagship was fine. Latency was fine. The router, it turned out, was sending 71% of queries to the cheap model. It had been since week two. The cheap model was the product for most users, and the cheap model had no eval suite at all.

This is the most common failure mode I see in 2026 among teams that adopted LLM routing for cost control: the eval discipline gets attached to the expensive tail of the system, and the cheap head — the part that defines the product for most of the request volume — runs blind.

Your Shadow Eval Set Is a Compliance Time-Bomb

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The most dangerous data store in your AI stack is the one nobody designed. It started with a Slack message during a sprint: "Real users are the only thing that catches real bugs — let's tap a percentage of production traffic into the eval pipeline so we can replay it nightly." Six engineers thumbs-upped the message. Nine months later, the bucket holds 4.3 million traces, an eval job pages the on-call when failure rates rise, and the failure cases are emailed verbatim to a Slack channel where forty people can read them. The traces include email addresses, internal company names, partial credit-card digits, employee phone numbers, and customer support transcripts where users explained why they were upset.

Nobody mapped the data flow. No DPIA covered it. The privacy review last quarter looked at the model vendor's API; it didn't look at your eval job. And then a data-subject deletion request arrives, and the team discovers that "delete this user's data everywhere" is a sentence that no longer maps to anything they can actually do.

The AI Feature Metric Trap: Why DAU and Retention Lie About Stochastic Surfaces

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

A PM walks into the AI feature review with a slide that reads "+12% engagement, +8% session length, retention up 3 points." The room nods. Two desks over, the support lead is staring at a different chart: tickets touching the AI surface are up 22%, and the most common resolution code is "user gave up, agent helped manually." Both numbers are real. Both come from the same product. The PM's dashboard is built on the assumption that the AI feature emits the same shape of event as the button it replaced. It doesn't. And the gap between what the dashboard counts and what the user experienced is where AI features quietly fail in plain sight.

The deterministic-feature playbook treats interaction as a click stream: user fires an event, the system reacts, the user moves on. AI features have a different event shape — a task arc with phases, retries, side trips to a human, and an offline judgment the telemetry never sees. Importing the deterministic dashboard onto that arc is the analytics equivalent of running 2018's interview loop against 2026's job. The numbers go up. The thing the numbers were supposed to predict goes down.

Your stop_reason Is Lying: Building the Real Stop Taxonomy Production Triage Needs

April 27, 2026 · 12 min read

Tian Pan

Software Engineer

The on-call engineer pulls up a trace. The model returned, the span closed clean, the API call shows stop_reason: end_turn. By every signal the platform offers, this was a successful generation. Three minutes later a customer reports that the agent confidently wrote half a config file, declared the operation complete, and moved on. The trace had no warning sign because the warning sign isn't in the API contract — the provider's stop reason has four to seven buckets, and the question your incident demands an answer to lives in the gap between them.

Stop reasons are the field engineers reach for first during triage and the field that lies most cleanly when it does. The values are designed for a runtime that needs to decide what to do next: was this turn complete, did a tool get requested, did a budget get exceeded, did safety intervene. They are not designed for a human reconstructing why an answer went wrong, and the difference between those two purposes is where production teams burn entire afternoons.

Trust Ceilings: The Autonomy Variable Your Product Team Can't See

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

Every agentic feature has a maximum autonomy level above which users start checking work, intervening, or abandoning the feature entirely. That maximum is not a property of your model. It is a property of your users, your domain, and the cost of being wrong, and it does not move because a launch deck says it should. Most teams discover their ceiling the hard way: a feature ships designed for full autonomy, adoption stalls at "agent suggests, human approves," the metrics blame the model, and the next quarter is spent tuning a knob that was never the bottleneck.

The shape of the ceiling is consistent enough across products that it deserves a name. Anthropic's own usage data on Claude Code shows new users using full auto-approve about 20% of the time, climbing past 40% only after roughly 750 sessions. PwC's 2025 survey of 300 senior executives found 79% of companies are using AI agents, but most production deployments operate at "collaborator" or "consultant" levels — the model proposes, the human disposes — not at the fully autonomous tier the marketing implied. The story underneath those numbers is not that users are timid. It is that trust is calibrated to the cost of a recoverable mistake, and your product almost certainly does not let users see, undo, or bound that cost the way they need to.

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.

Your APIs Assumed One Human at a Time. Parallel Agents Broke the Contract.

April 26, 2026 · 12 min read

Tian Pan

Software Engineer

A backend engineer I know spent a Tuesday afternoon staring at a Datadog graph that had never spiked before: the per-user 429 counter on their internal calendar service. The customer complaining had not changed their behavior. They had simply turned on the assistant feature, which now spawned eight planning threads in parallel against the same calendar API every time the user said "find me time next week." The rate limiter — a perfectly reasonable 60 requests per minute per user, written years ago against a UI that physically could not click that fast — was firing within the first three seconds of every request and silently corrupting half the assistant's responses.

The rate limit was not the bug. The contract was the bug. That backend, like most internal services written before 2024, had a quietly enforced assumption baked into every layer: one user means one stream of activity, paced by a human's reaction time, with one cookie jar, one CSRF token, and one set of credentials that could be re-prompted if anything went wrong. Agents shred all five of those assumptions at once, and the failures show up as a constellation of unrelated incidents — 429 storms, last-write-wins corruption, audit logs you can't subpoena, re-auth loops that hang headless workers — that nobody connects until the pattern is named.

The shorthand I have been using with platform teams is this: every backend you own has an undocumented contract with its callers, and that contract was negotiated with humans. Agents are now showing up to renegotiate. You can either do the renegotiation deliberately, in code review, or you can do it during your next incident.

Your Model Update Is a Breaking Change: The Behavioral Changelog You Owe Your Integrators

April 26, 2026 · 12 min read

Tian Pan

Software Engineer

A vendor pushes a "minor refresh" to a model alias on a Tuesday afternoon. By Thursday, four customer companies are running incident response. None of them deployed code that week. None of their dashboards show a regression in latency, error rate, or any other infra-shaped metric. What changed is that the model behind their pinned alias started returning slightly different sentences, slightly different JSON, and slightly different refusals — and every prompt their team wrote against the old behavior is now a contract that nobody honored.

The asymmetry is the entire story. The provider treated the rollout as a deploy: tested internally, gated on a few aggregate evals, ramped to 100% within a maintenance window. The consumer surface received it as a semver violation: a dependency upgraded itself in production without changing its version string, and the bug reports started rolling in from end users with the cheerful subject line "nothing changed on our side."

The kWh Column Missing From Your Inference Span: Carbon Attribution Per Request

April 26, 2026 · 10 min read

Tian Pan

Software Engineer

Your inference flame graph has a cost axis. It does not have an energy axis. That gap is fine right up until the morning a customer's procurement team sends you a spreadsheet with twenty-three columns of vendor sustainability disclosures, and one of them is kgCO2e per 1,000 inferences. You have no way to fill that cell, your provider's answer is a methodology paper, and the deal closes in nine days. The token-cost dashboard your platform team has been polishing for two years suddenly looks like it was solving the wrong problem.

The shift here is not abstract. Sustainability disclosure is moving from corporate aggregate to product-level granularity. The first wave of that movement landed inside CSRD and ESRS in 2025, and the second wave is landing in B2B procurement contracts right now. Engineering organizations that built observability for cost are about to discover they need observability for carbon, and the two are not the same column on the same span.

The Contestability Gap: Engineering AI Decisions Your Users Can Actually Appeal

April 26, 2026 · 11 min read

Tian Pan

Software Engineer

A user opens a chat, asks for a refund, gets "I'm sorry, this purchase is not eligible for a refund," closes the tab, and never comes back. Internally, the agent emitted a beautiful trace: tool calls, intermediate reasoning, the policy bundle it consulted, the model version it ran on. Every span landed in the observability platform. None of it landed anywhere the user could reach. There is no button labeled "ask a human to look at this again," and even if there were, there is no service behind it. The decision is final by default, not by design.

This is the contestability gap, and it is the next thing regulators, lawyers, and angry users are going to rip open. It is also one of the cleanest examples of a problem that looks like policy from the outside and turns out to be plumbing on the inside.

About Tian Pan