Skip to main content

183 posts tagged with "observability"

View all tags

Your Shadow Eval Set Is a Compliance Time-Bomb

· 10 min read
Tian Pan
Software Engineer

The most dangerous data store in your AI stack is the one nobody designed. It started with a Slack message during a sprint: "Real users are the only thing that catches real bugs — let's tap a percentage of production traffic into the eval pipeline so we can replay it nightly." Six engineers thumbs-upped the message. Nine months later, the bucket holds 4.3 million traces, an eval job pages the on-call when failure rates rise, and the failure cases are emailed verbatim to a Slack channel where forty people can read them. The traces include email addresses, internal company names, partial credit-card digits, employee phone numbers, and customer support transcripts where users explained why they were upset.

Nobody mapped the data flow. No DPIA covered it. The privacy review last quarter looked at the model vendor's API; it didn't look at your eval job. And then a data-subject deletion request arrives, and the team discovers that "delete this user's data everywhere" is a sentence that no longer maps to anything they can actually do.

The AI Feature Metric Trap: Why DAU and Retention Lie About Stochastic Surfaces

· 11 min read
Tian Pan
Software Engineer

A PM walks into the AI feature review with a slide that reads "+12% engagement, +8% session length, retention up 3 points." The room nods. Two desks over, the support lead is staring at a different chart: tickets touching the AI surface are up 22%, and the most common resolution code is "user gave up, agent helped manually." Both numbers are real. Both come from the same product. The PM's dashboard is built on the assumption that the AI feature emits the same shape of event as the button it replaced. It doesn't. And the gap between what the dashboard counts and what the user experienced is where AI features quietly fail in plain sight.

The deterministic-feature playbook treats interaction as a click stream: user fires an event, the system reacts, the user moves on. AI features have a different event shape — a task arc with phases, retries, side trips to a human, and an offline judgment the telemetry never sees. Importing the deterministic dashboard onto that arc is the analytics equivalent of running 2018's interview loop against 2026's job. The numbers go up. The thing the numbers were supposed to predict goes down.

Your stop_reason Is Lying: Building the Real Stop Taxonomy Production Triage Needs

· 12 min read
Tian Pan
Software Engineer

The on-call engineer pulls up a trace. The model returned, the span closed clean, the API call shows stop_reason: end_turn. By every signal the platform offers, this was a successful generation. Three minutes later a customer reports that the agent confidently wrote half a config file, declared the operation complete, and moved on. The trace had no warning sign because the warning sign isn't in the API contract — the provider's stop reason has four to seven buckets, and the question your incident demands an answer to lives in the gap between them.

Stop reasons are the field engineers reach for first during triage and the field that lies most cleanly when it does. The values are designed for a runtime that needs to decide what to do next: was this turn complete, did a tool get requested, did a budget get exceeded, did safety intervene. They are not designed for a human reconstructing why an answer went wrong, and the difference between those two purposes is where production teams burn entire afternoons.

Trust Ceilings: The Autonomy Variable Your Product Team Can't See

· 10 min read
Tian Pan
Software Engineer

Every agentic feature has a maximum autonomy level above which users start checking work, intervening, or abandoning the feature entirely. That maximum is not a property of your model. It is a property of your users, your domain, and the cost of being wrong, and it does not move because a launch deck says it should. Most teams discover their ceiling the hard way: a feature ships designed for full autonomy, adoption stalls at "agent suggests, human approves," the metrics blame the model, and the next quarter is spent tuning a knob that was never the bottleneck.

The shape of the ceiling is consistent enough across products that it deserves a name. Anthropic's own usage data on Claude Code shows new users using full auto-approve about 20% of the time, climbing past 40% only after roughly 750 sessions. PwC's 2025 survey of 300 senior executives found 79% of companies are using AI agents, but most production deployments operate at "collaborator" or "consultant" levels — the model proposes, the human disposes — not at the fully autonomous tier the marketing implied. The story underneath those numbers is not that users are timid. It is that trust is calibrated to the cost of a recoverable mistake, and your product almost certainly does not let users see, undo, or bound that cost the way they need to.

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

· 11 min read
Tian Pan
Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.

Your APIs Assumed One Human at a Time. Parallel Agents Broke the Contract.

· 12 min read
Tian Pan
Software Engineer

A backend engineer I know spent a Tuesday afternoon staring at a Datadog graph that had never spiked before: the per-user 429 counter on their internal calendar service. The customer complaining had not changed their behavior. They had simply turned on the assistant feature, which now spawned eight planning threads in parallel against the same calendar API every time the user said "find me time next week." The rate limiter — a perfectly reasonable 60 requests per minute per user, written years ago against a UI that physically could not click that fast — was firing within the first three seconds of every request and silently corrupting half the assistant's responses.

The rate limit was not the bug. The contract was the bug. That backend, like most internal services written before 2024, had a quietly enforced assumption baked into every layer: one user means one stream of activity, paced by a human's reaction time, with one cookie jar, one CSRF token, and one set of credentials that could be re-prompted if anything went wrong. Agents shred all five of those assumptions at once, and the failures show up as a constellation of unrelated incidents — 429 storms, last-write-wins corruption, audit logs you can't subpoena, re-auth loops that hang headless workers — that nobody connects until the pattern is named.

The shorthand I have been using with platform teams is this: every backend you own has an undocumented contract with its callers, and that contract was negotiated with humans. Agents are now showing up to renegotiate. You can either do the renegotiation deliberately, in code review, or you can do it during your next incident.

Your Model Update Is a Breaking Change: The Behavioral Changelog You Owe Your Integrators

· 12 min read
Tian Pan
Software Engineer

A vendor pushes a "minor refresh" to a model alias on a Tuesday afternoon. By Thursday, four customer companies are running incident response. None of them deployed code that week. None of their dashboards show a regression in latency, error rate, or any other infra-shaped metric. What changed is that the model behind their pinned alias started returning slightly different sentences, slightly different JSON, and slightly different refusals — and every prompt their team wrote against the old behavior is now a contract that nobody honored.

The asymmetry is the entire story. The provider treated the rollout as a deploy: tested internally, gated on a few aggregate evals, ramped to 100% within a maintenance window. The consumer surface received it as a semver violation: a dependency upgraded itself in production without changing its version string, and the bug reports started rolling in from end users with the cheerful subject line "nothing changed on our side."

The kWh Column Missing From Your Inference Span: Carbon Attribution Per Request

· 10 min read
Tian Pan
Software Engineer

Your inference flame graph has a cost axis. It does not have an energy axis. That gap is fine right up until the morning a customer's procurement team sends you a spreadsheet with twenty-three columns of vendor sustainability disclosures, and one of them is kgCO2e per 1,000 inferences. You have no way to fill that cell, your provider's answer is a methodology paper, and the deal closes in nine days. The token-cost dashboard your platform team has been polishing for two years suddenly looks like it was solving the wrong problem.

The shift here is not abstract. Sustainability disclosure is moving from corporate aggregate to product-level granularity. The first wave of that movement landed inside CSRD and ESRS in 2025, and the second wave is landing in B2B procurement contracts right now. Engineering organizations that built observability for cost are about to discover they need observability for carbon, and the two are not the same column on the same span.

The Contestability Gap: Engineering AI Decisions Your Users Can Actually Appeal

· 11 min read
Tian Pan
Software Engineer

A user opens a chat, asks for a refund, gets "I'm sorry, this purchase is not eligible for a refund," closes the tab, and never comes back. Internally, the agent emitted a beautiful trace: tool calls, intermediate reasoning, the policy bundle it consulted, the model version it ran on. Every span landed in the observability platform. None of it landed anywhere the user could reach. There is no button labeled "ask a human to look at this again," and even if there were, there is no service behind it. The decision is final by default, not by design.

This is the contestability gap, and it is the next thing regulators, lawyers, and angry users are going to rip open. It is also one of the cleanest examples of a problem that looks like policy from the outside and turns out to be plumbing on the inside.

The Dual-Writer Race: When Your Agent and Your User Edit the Same Calendar Event

· 12 min read
Tian Pan
Software Engineer

The agent confidently reports "I rescheduled the meeting to Thursday at 3pm." The user is staring at the original Tuesday 10am slot, because between the agent's plan and its commit they edited the event themselves. Last-write-wins overwrote a human change with an automated one, and the user's trust in the assistant collapses on a single incident. This is the dual-writer race, and it is the bug class that agent toolchains were never designed for.

Most agent platforms inherit this category by accident. The tool layer treats update_event as a single function call: take an ID, take new fields, return success. The provider API underneath has offered optimistic concurrency primitives — ETags, version tokens, If-Match preconditions — for a decade, and almost nobody plumbs them through. The model has no way to know that the world it reasoned about a minute ago is no longer current, because the abstraction it was given silently throws that information away.

Sovereignty Collapse: Logging Where Your Prompt Actually Went

· 9 min read
Tian Pan
Software Engineer

A regulator asks a simple question. "For this specific user prompt, submitted at 14:32 UTC last Tuesday, prove which jurisdictions the request and its derived state passed through."

Your application logs say model=claude-sonnet-4-5, region=eu-west-1, latency=2.1s. Your gateway logs say the same. Your provider's invoice confirms the request happened. None of these answer the question. The request entered an EU-hosted gateway, was forwarded to a US-region primary endpoint that failed over to Singapore during a regional incident, and warmed a KV cache on a third-party GPU pool whose residency claims live in a vendor footnote. The audit trail you needed lives at a layer your team does not own.

This is sovereignty collapse: the gap between what your contracts promise about data location and what your runtime can actually prove after the fact. The compliance claim is only as strong as the weakest log line in the chain.

Your Span Names Are an Undocumented API: Telemetry Contracts Between Agent Teams

· 10 min read
Tian Pan
Software Engineer

The cost spike that paged finance at 3 a.m. was not a cost spike. It was a span rename. Someone on the agent platform team decided that llm.completion.synthesis should really be llm.generate.answer because it read more naturally, opened a small PR, ran their tests, and shipped. Three days later finance's monthly token-spend dashboard showed a 60% drop. Nobody had cut spending. The aggregation rule still grouped by the old name, and the new spans flowed past it into an "other" bucket that the dashboard didn't even render. The bill didn't move. The dashboard did.

This is a class of incident I keep watching teams rediscover. Span names and attribute keys are not labels for humans to read in a trace UI. They are the public schema of an undocumented API, with consumers that the producer team has never met — eval pipelines that filter on them, cost dashboards that group by them, SLO alerts that fire on their durations, FinOps reports that sum their token attributes. A "harmless rename" inside one team is a wire-protocol break for four other teams that never saw the PR.