Skip to main content

299 posts tagged with "observability"

View all tags

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

· 11 min read
Tian Pan
Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.

Your APIs Assumed One Human at a Time. Parallel Agents Broke the Contract.

· 12 min read
Tian Pan
Software Engineer

A backend engineer I know spent a Tuesday afternoon staring at a Datadog graph that had never spiked before: the per-user 429 counter on their internal calendar service. The customer complaining had not changed their behavior. They had simply turned on the assistant feature, which now spawned eight planning threads in parallel against the same calendar API every time the user said "find me time next week." The rate limiter — a perfectly reasonable 60 requests per minute per user, written years ago against a UI that physically could not click that fast — was firing within the first three seconds of every request and silently corrupting half the assistant's responses.

The rate limit was not the bug. The contract was the bug. That backend, like most internal services written before 2024, had a quietly enforced assumption baked into every layer: one user means one stream of activity, paced by a human's reaction time, with one cookie jar, one CSRF token, and one set of credentials that could be re-prompted if anything went wrong. Agents shred all five of those assumptions at once, and the failures show up as a constellation of unrelated incidents — 429 storms, last-write-wins corruption, audit logs you can't subpoena, re-auth loops that hang headless workers — that nobody connects until the pattern is named.

The shorthand I have been using with platform teams is this: every backend you own has an undocumented contract with its callers, and that contract was negotiated with humans. Agents are now showing up to renegotiate. You can either do the renegotiation deliberately, in code review, or you can do it during your next incident.

Your Model Update Is a Breaking Change: The Behavioral Changelog You Owe Your Integrators

· 12 min read
Tian Pan
Software Engineer

A vendor pushes a "minor refresh" to a model alias on a Tuesday afternoon. By Thursday, four customer companies are running incident response. None of them deployed code that week. None of their dashboards show a regression in latency, error rate, or any other infra-shaped metric. What changed is that the model behind their pinned alias started returning slightly different sentences, slightly different JSON, and slightly different refusals — and every prompt their team wrote against the old behavior is now a contract that nobody honored.

The asymmetry is the entire story. The provider treated the rollout as a deploy: tested internally, gated on a few aggregate evals, ramped to 100% within a maintenance window. The consumer surface received it as a semver violation: a dependency upgraded itself in production without changing its version string, and the bug reports started rolling in from end users with the cheerful subject line "nothing changed on our side."

The kWh Column Missing From Your Inference Span: Carbon Attribution Per Request

· 10 min read
Tian Pan
Software Engineer

Your inference flame graph has a cost axis. It does not have an energy axis. That gap is fine right up until the morning a customer's procurement team sends you a spreadsheet with twenty-three columns of vendor sustainability disclosures, and one of them is kgCO2e per 1,000 inferences. You have no way to fill that cell, your provider's answer is a methodology paper, and the deal closes in nine days. The token-cost dashboard your platform team has been polishing for two years suddenly looks like it was solving the wrong problem.

The shift here is not abstract. Sustainability disclosure is moving from corporate aggregate to product-level granularity. The first wave of that movement landed inside CSRD and ESRS in 2025, and the second wave is landing in B2B procurement contracts right now. Engineering organizations that built observability for cost are about to discover they need observability for carbon, and the two are not the same column on the same span.

The Contestability Gap: Engineering AI Decisions Your Users Can Actually Appeal

· 11 min read
Tian Pan
Software Engineer

A user opens a chat, asks for a refund, gets "I'm sorry, this purchase is not eligible for a refund," closes the tab, and never comes back. Internally, the agent emitted a beautiful trace: tool calls, intermediate reasoning, the policy bundle it consulted, the model version it ran on. Every span landed in the observability platform. None of it landed anywhere the user could reach. There is no button labeled "ask a human to look at this again," and even if there were, there is no service behind it. The decision is final by default, not by design.

This is the contestability gap, and it is the next thing regulators, lawyers, and angry users are going to rip open. It is also one of the cleanest examples of a problem that looks like policy from the outside and turns out to be plumbing on the inside.

The Dual-Writer Race: When Your Agent and Your User Edit the Same Calendar Event

· 12 min read
Tian Pan
Software Engineer

The agent confidently reports "I rescheduled the meeting to Thursday at 3pm." The user is staring at the original Tuesday 10am slot, because between the agent's plan and its commit they edited the event themselves. Last-write-wins overwrote a human change with an automated one, and the user's trust in the assistant collapses on a single incident. This is the dual-writer race, and it is the bug class that agent toolchains were never designed for.

Most agent platforms inherit this category by accident. The tool layer treats update_event as a single function call: take an ID, take new fields, return success. The provider API underneath has offered optimistic concurrency primitives — ETags, version tokens, If-Match preconditions — for a decade, and almost nobody plumbs them through. The model has no way to know that the world it reasoned about a minute ago is no longer current, because the abstraction it was given silently throws that information away.

Sovereignty Collapse: Logging Where Your Prompt Actually Went

· 9 min read
Tian Pan
Software Engineer

A regulator asks a simple question. "For this specific user prompt, submitted at 14:32 UTC last Tuesday, prove which jurisdictions the request and its derived state passed through."

Your application logs say model=claude-sonnet-4-5, region=eu-west-1, latency=2.1s. Your gateway logs say the same. Your provider's invoice confirms the request happened. None of these answer the question. The request entered an EU-hosted gateway, was forwarded to a US-region primary endpoint that failed over to Singapore during a regional incident, and warmed a KV cache on a third-party GPU pool whose residency claims live in a vendor footnote. The audit trail you needed lives at a layer your team does not own.

This is sovereignty collapse: the gap between what your contracts promise about data location and what your runtime can actually prove after the fact. The compliance claim is only as strong as the weakest log line in the chain.

Your Span Names Are an Undocumented API: Telemetry Contracts Between Agent Teams

· 10 min read
Tian Pan
Software Engineer

The cost spike that paged finance at 3 a.m. was not a cost spike. It was a span rename. Someone on the agent platform team decided that llm.completion.synthesis should really be llm.generate.answer because it read more naturally, opened a small PR, ran their tests, and shipped. Three days later finance's monthly token-spend dashboard showed a 60% drop. Nobody had cut spending. The aggregation rule still grouped by the old name, and the new spans flowed past it into an "other" bucket that the dashboard didn't even render. The bill didn't move. The dashboard did.

This is a class of incident I keep watching teams rediscover. Span names and attribute keys are not labels for humans to read in a trace UI. They are the public schema of an undocumented API, with consumers that the producer team has never met — eval pipelines that filter on them, cost dashboards that group by them, SLO alerts that fire on their durations, FinOps reports that sum their token attributes. A "harmless rename" inside one team is a wire-protocol break for four other teams that never saw the PR.

User-Side Concept Drift: When Your Prompt Held but Your Users Moved

· 10 min read
Tian Pan
Software Engineer

Most teams set up drift monitoring on the wrong side of the contract. They watch the model — capability shifts when a vendor pushes a new checkpoint, output distribution changes after a prompt rewrite, refusal-rate spikes that signal a safety filter retune. The dashboards are detailed, the alerts are wired into PagerDuty, and the team has a runbook for "the model moved." None of that helps when the model didn't move and the dashboard still goes red, because the thing that shifted was your users.

User-side concept drift is the version of this problem that almost every eval pipeline misses. Your prompt, your model, and your tools are byte-identical to the day you launched. Your golden test set still passes 91%. But the prompt that hit 91% in week one is now serving 78% in week thirty, because the input distribution has moved underneath it — users learned the product and changed how they ask, vocabulary mutated, seasonal task types appeared, a competitor reframed the category, a viral thread taught a new way to phrase the same intent. The model and prompt held. The contract held. The world the contract was negotiated against did not.

Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

· 11 min read
Tian Pan
Software Engineer

A web service request produces five spans on a busy day. A modern agent session produces fifty, sometimes a thousand if the planner decides to recurse. The uniform 1% sampler your platform team copy-pasted from the microservices era will, by definition, drop the rare failure you actually care about — because the failure is rare, and uniform sampling has no opinion about rarity.

The honest version of "we have full observability on our agents" sounds different than the marketing version. It sounds like: we keep the traces that matter, drop the ones that don't, and we know in advance which is which. Every word in that sentence is load-bearing, and the platform teams that ignored sampling design until the bill arrived are now learning the discipline backwards — under cost pressure, after a quarter of incidents that were "in the data" but evicted before anyone looked.

You Accidentally Built a Feature-Flag System for Prompts — Without the Governance

· 10 min read
Tian Pan
Software Engineer

Pull up the config repo your team uses to ship prompt changes. Look at the last thirty commits. How many had a code review? How many had an eval gate in CI? How many can you attribute — with certainty — to a measurable change in production behavior for the users who saw them? If your answer is "most," you are an outlier. For everyone else, those commits are running in production right now, and the system reading them is doing exactly what a feature-flag service does: hot-reload a value, fan it out to users, change product behavior. The difference is that your feature-flag service has audit logs, exposure tracking, kill switches, and per-cohort targeting. Your prompt deploy pipeline has git push.

This is not a metaphor. It is an accurate description of the production system your team is running. The prompt config repo, the S3 bucket your workers poll, the "prompts" collection in your database, the LangSmith/PromptLayer/Braintrust asset that your app fetches on boot — these are all feature-flag services. They have the same runtime shape: a value lives outside the binary, the binary reads it on a hot path, changing the value changes behavior for real users without a deploy. The only thing missing is every control your SRE team demanded before they would approve the actual feature-flag service.

The Acknowledgment-Action Gap: Your Agent's 'Got It' Is Not a Commitment

· 11 min read
Tian Pan
Software Engineer

An agent tells a customer: "Got it — I've submitted your refund request. You should see it in 5–7 business days." The customer closes the chat. No refund was ever submitted. There is no ticket, no API call, no row in the refunds table. Just a paragraph of polite, confident English, followed by a successful session termination.

This is the acknowledgment-action gap, and it is the single most expensive class of bug in production agent systems. The gap exists because the fluent prose that makes instruction-tuned models feel competent is a different output channel than the structured tool calls that actually change the world — and most teams wire their business logic to the wrong one.

Everyone who ships an agent eventually learns this the hard way. The model produces a polished confirmation that reads like a commitment, the downstream system interprets it as a commitment, and weeks later a support ticket arrives asking where the refund went. The embarrassing part is not that the model lied. The embarrassing part is that the system was designed to trust what it said.