330 posts tagged with "observability"

The Carbon Math of Agent Workflows: A Token Budget Is Now an ESG Disclosure

April 27, 2026 · 10 min read

Software Engineer

A stateless chat completion sips electricity. A median Gemini text prompt clocks in at about 0.24 Wh; a short GPT-4o query is around 0.3–0.4 Wh. These numbers are small enough that nobody puts them on a board deck.

An agent task is not a chat completion. A typical "go research this customer and draft a reply" workflow can fan out to 30+ tool calls, 10–15 model invocations, and a context window that grows with every step. The energy cost compounds with the call graph. By the time the agent returns, you have not consumed one unit of inference — you have consumed fifty to two hundred. Suddenly the per-task footprint is in the same order of magnitude as a video stream.

That arithmetic is about to matter outside the engineering org. The EU's CSRD makes Scope 3 emissions disclosure mandatory for in-scope companies, with machine-readable iXBRL reporting required from 2026. The SEC dropped Scope 3 from its final rule, but any multinational with EU operations still has to answer the question. Procurement teams have started adding "what is the carbon footprint per user task of your AI feature?" to vendor questionnaires. Most engineering teams cannot answer it, because nobody instrumented it.

Cancel-Safe Agents: The Side Effects Your Stop Button Already Shipped

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

A user clicks Stop because the agent misunderstood the request. The UI flashes "stopped." By the time the spinner disappears, the agent has already sent two emails, scheduled a Tuesday meeting on the user's calendar, opened a draft pull request against the wrong branch, and queued a Slack message that is racing the cancellation signal through the tool layer. The model has obediently stopped generating tokens. The world has not stopped reacting to the tokens it generated thirty seconds ago.

This is the failure mode nobody covered in the agent demo. Cancellation in synchronous code was already a hard problem with a generation of cooperative-cancellation theory behind it: Go contexts, Python's asyncio.cancel, structured concurrency with task groups, the whole grammar of "ask politely, escalate carefully, don't leave resources behind." Agents take that already-hard problem and add a layer on top: the planner does not know that the user revoked authorization between step 4 and step 5, and the tools it kicked off in step 4 do not get a memo when step 5 is cancelled. Stop is a UI affordance. The system underneath stop has to be designed.

The Chargeback Ledger for Compound AI Systems

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The first time the CFO asks "what does the assistant cost us per month," the engineering team produces a number. The second time, a different team produces a different number. The third time, finance produces a third number, and somebody opens a spreadsheet that re-derives the bill from spans because nobody trusts any of the previous answers. This is the moment a compound AI system stops being an architecture problem and becomes an accounting problem.

The shape of the failure is structural. A single user request to "summarize my last quarter's customer feedback" triggers an agent owned by team A, which calls a retrieval tool maintained by team B, which calls a model hosted by provider X, which streams results back through a re-ranking tool from team C, which calls a different model from provider Y. One click; five owners; two invoices that arrive a month apart. Standard FinOps primitives — cost centers, allocation tags, account-level rollups — were designed to slice infrastructure that already had stable owners. They do not compose cleanly across an internal call graph that crosses team boundaries on every request.

The 2026 State of FinOps report puts 98% of FinOps teams on the hook for AI spend, and the same survey lists real-time visibility into AI costs as the top tooling gap. That gap is not "we cannot see the bill." The gap is "we cannot see who caused what slice of the bill, fast enough that anyone changes their behavior before the bill arrives."

Contract Tests for LLM Tool Surfaces: When the Vendor Changes a Field and Your Agent Silently Adapts

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

A vendor flipped "items" to "results" in a tool response last Tuesday. The agent didn't crash. It re-planned around the new shape, returned a confident-looking answer that was missing two-thirds of the rows, and the on-call engineer found out three days later when a customer asked why their export was short. No exception fired. No alert tripped. The eval suite, which runs against a frozen fixture from before the vendor change, was green the whole time.

This is the failure mode that contract testing was invented to catch in microservices a decade ago, and the one that almost no agent stack has any equivalent for today. HTTP services have Pact, schemathesis, and oasdiff to sit between consumer and provider and refuse to let breaking changes ship. The tools you hand to your agent — REST endpoints, internal RPCs, vendor SDKs, MCP servers — have nothing comparable. The model absorbs the change, adapts gracefully, and produces a degraded answer that looks correct.

Counterfactual Logging: Log Enough Today to Replay Yesterday's Traffic Against Next Year's Model

April 27, 2026 · 13 min read

Tian Pan

Software Engineer

Every LLM team eventually gets the same email from a director: "Anthropic shipped a new Sonnet. Run our traffic against it and tell me by Friday whether we should switch." The team opens the production trace store, pulls last month's requests, queues them against the new model — and three hours in, somebody asks why the diff scores look insane on tool-using turns. The answer: nobody captured the tool responses in their original form. The traces logged the model's reply faithfully and stored a one-line summary of what each tool returned. Replaying those requests doesn't replay what the old model actually saw; it replays a heavily compressed projection of it. The migration evaluation isn't measuring the new model. It's measuring the new model talking to a different reality.

This is the failure mode I want to talk about. Most production LLM logs are output-shaped: they answer "what did the model say?" reasonably well, and answer "what did the model see?" only sketchily. That asymmetry is invisible until the day you need to replay history against a new model — at which point it becomes the entire story, because the gap between what was logged and what was sent is exactly the gap between a real evaluation and a fake one.

Call it counterfactual logging: capture today the inputs you'd need to ask "what would that other model have done with this exact request?" tomorrow. The bar isn't "we logged the request." The bar is "we can re-execute the request against a different model and trust the result is meaningful."

The Eval-Rig Latency Lie: Why Your p95 Doubles in Production

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The eval team puts a number on the deck: "p95 latency is 1.2s." The launch ships. A week later, oncall posts a graph: production p95 is 4.8s and climbing through the dinner-time peak. Engineers spend the next five days arguing about whether something regressed, instrumenting model versions, opening tickets with the provider — and eventually discover that nothing changed except where the number was measured. The eval rig was reporting the latency of a quiet machine running serial calls against a warm cache. Production is a different system. The p95 was never wrong; it was answering a different question.

This is the eval-rig latency lie. It is not about bad benchmarks — most teams use reasonable tools and report the numbers honestly. It is about the gap between "the latency of the model" and "the latency a user experiences," and the fact that the rig you build for development almost always measures the first while implying the second. Once you internalize this, latency SLOs derived from a benchmark stop looking like product commitments and start looking like claims about a private testing environment that nobody else can reproduce.

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The most expensive failure mode in an LLM product is not a bad release. It is six consecutive good releases — by every internal scoreboard — while user trust quietly bleeds out. The offline eval score climbs every Friday demo. The CSAT line in the weekly business review goes flat, then dips, then nobody knows when it started dipping because nobody was triangulating the two charts. By the time a postmortem names it, the team has spent two quarters tuning a prompt against a dataset that stopped resembling reality somewhere around month three.

This is the eval-set-as-simulator drift, and it is the cleanest example I know of an old machine-learning lesson being rediscovered at full retail price by a generation of LLM teams who skipped the reading list. An eval suite is not a fixture. It is a simulator, and a simulator that is never re-calibrated against the system it claims to predict will eventually predict a different system.

Found Capabilities: When Users Ship Features Your Team Never Roadmapped

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A customer emails support to ask why your CRM agent stopped drafting their NDAs. You did not know your CRM agent drafted NDAs. A power user complains that your support bot's Tagalog translations have gotten worse since last week. You did not know your support bot did Tagalog. A forum thread spreads a prompt that turns your code-review assistant into a passable security scanner, and within a quarter you are getting CVE reports filed against findings the assistant produced. Each of these is a feature with adoption, business impact, and zero institutional ownership — no eval, no SLA, no surface in the UX, no roadmap entry, and a quiet bus factor of one: the customer who figured it out.

This is what happens once your product is wrapped around a model whose capability surface is wider than the surface you scoped. Users explore the wider surface, find behaviors that solve their problems, build workflows on top of those behaviors, and then experience your next model upgrade as a regression even though nothing on your roadmap moved. The contract between you and your users is no longer the one you wrote down. It includes everything the model happened to do for them that you happened not to break.

Treating this as an engineering surprise — "we will harden the prompt, we will add a guardrail, we will catch it next time" — is a category error. Found capabilities are a product-management problem. The discipline is not preventing them; it is detecting them, deciding what to do with them, and remembering that you decided.

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding

April 27, 2026 · 12 min read

Tian Pan

Software Engineer

A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.

Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.

This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.

Prompt-Version Skew Across Regions: The Unintended A/B Test Your CDN Ran for Six Hours

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

You shipped a system-prompt change at 09:14. The rollout dashboard turned green at 09:31. By 11:00 your eval tracker still looked clean, the cost dashboard was unremarkable, and a customer-success engineer pinged the team: structured-output errors on the parser side were up about three percent in Asia-Pacific only. Nothing in North America. Nothing in Europe.

The rollout had paused itself at 67% region coverage because a non-load-bearing health check on one POP flapped during the cutover, and nobody had noticed. For six hours, us-east and eu-west were running prompt v47 while ap-south and ap-northeast were still on v46. You were running a live A/B test split by geography — except you didn't design the test, you couldn't see the test, and the eval suite that was supposed to catch quality regressions was hitting the new version in one region and shrugging.

This failure mode is not a bug in any single tool. It is the predictable consequence of pushing prompts through deployment systems built for a different kind of artifact.

About Tian Pan