Skip to main content

299 posts tagged with "observability"

View all tags

Capacity Planning When Every Request Thinks a Different Amount

· 10 min read
Tian Pan
Software Engineer

Classic capacity planning rests on a quiet assumption: requests are roughly interchangeable. A web server handles a login, a search, a checkout — and while those differ, they differ within a band. You measure requests per second, watch p50 and p99 latency, multiply by a safety factor, and provision. The model works because the unit of work — one request — has a stable cost.

Agent workloads break that assumption at the root. One query to your agent resolves in a single short completion: 300 tokens in, 200 out, done in two seconds. The next query, superficially identical, spawns a planning step, fans out to forty tool calls, re-reads its own growing context on every turn, and burns 1.2 million tokens over four minutes. Same endpoint. Same user. Same code path. The cost per request varied by three orders of magnitude, and nothing in the request told you which one you were about to get.

The Carbon Line Item Nobody Puts in the AI Feature Spec

· 10 min read
Tian Pan
Software Engineer

Open any AI feature review and you will hear the same three numbers debated: latency, token cost, and accuracy. Someone pulls up the p95 chart, someone else does the math on cost-per-thousand-requests, and a third person argues the eval score is good enough to ship. Nobody mentions energy. Nobody mentions carbon. And because nobody mentions it, the environmental footprint of the feature still gets decided — implicitly, by whoever wins the argument about the dollar figure.

That is the quiet problem with AI sustainability. It is not that teams choose a high-carbon design on purpose. It is that they never choose at all. The footprint is a side effect of a cost decision, and cost only loosely tracks carbon. A routing rule that looks like a clean win on the spend dashboard can quietly double emissions, and no one in the room would know, because the number that would have told them was never on a dashboard.

This post treats energy and carbon as what they actually are: a measurable, ownable property of an AI system, on the same footing as latency and cost. Not a corporate-values footnote. A line item.

Your Retry Logic Is Teaching the Agent the Wrong Lesson

· 10 min read
Tian Pan
Software Engineer

A tool call fails. Your agent framework retries it three times with exponential backoff. The third attempt goes through. The trace shows a green checkmark. Nobody gets paged, no error counter increments, the user gets their answer. By every dashboard you have, the system worked.

It didn't. The tool failed because the agent passed a malformed argument, and the only reason the third try succeeded is that the agent — sampling differently each time — happened to phrase the call correctly on attempt three. You didn't recover from a transient fault. You ran a slot machine until it paid out, then logged the payout and threw away the two pulls that told you the agent was broken.

This is the quiet way retry logic rots an agent system. Retries were designed for a world where the caller is correct and the network is flaky. Agents invert that assumption: the network is mostly fine, and the caller is the unreliable part. When you point a retry policy built for the first world at the second one, it stops being a recovery mechanism and becomes a way to launder bugs into green checkmarks.

The Agent Feedback Loop You Never Built

· 9 min read
Tian Pan
Software Engineer

Every day your agent ships failures back to you, gift-wrapped. A user clicks thumbs-down. Another reads the answer, says nothing, and closes the tab. A third rephrases the same question three times until the agent finally gets it. Each of those is a labeled failure case — a real input, a real context, a real moment where the system fell short — handed to you for free by the people who care most about getting it right.

Most teams throw all of it away. Not deliberately. The thumbs-down increments a dashboard counter. The abandonment shows up as a dip in a retention chart. The rephrasing looks like ordinary usage. Nothing captures the signal together with the context that produced it, so nothing can be replayed, triaged, or turned into a test. The richest source of evaluation data you will ever have flows past untouched, and the team keeps writing synthetic eval cases by hand.

This is the agent feedback loop you never built. It is not a tool you forgot to buy. It is a pipeline — from user signal, to triaged failure, to new eval case — and the reason it stays unbuilt has very little to do with technology.

The Agent That Narrated a Number It Should Have Computed

· 10 min read
Tian Pan
Software Engineer

Ask your agent for last quarter's churn rate and it answers 4.2% in one clean sentence. The number is plausible. The prose around it is confident. The dashboard, when someone finally checks, says 6.8%. The agent never queried anything — it produced a churn-shaped token sequence because, to a language model, narrating a number and computing one look identical on the way out.

This is the quiet failure mode that survives every demo. A hallucinated tool name throws an error you can catch. A malformed argument fails a schema check. But a fabricated figure, delivered in fluent English, passes through your entire pipeline looking exactly like a real one. There is no exception, no log line, no red text. The only signal that something went wrong is a human who happens to know the right answer — and the whole point of the agent was that no human had to.

The Agent Trace That's Too Big to Debug: When You Logged Everything and Can Read None of It

· 11 min read
Tian Pan
Software Engineer

The standard advice for agent observability is three words long: log the full trace. Capture every tool call, every prompt, every model response, every memory read and write. Teams comply. Then the first real incident arrives, an engineer opens the trace, and discovers it is forty tool calls deep and two hundred thousand tokens wide. The trace is technically complete. It is also practically unreadable.

What follows is a familiar ritual. The engineer scrolls. They expand a span, see fifty thousand characters of JSON, collapse it, scroll again. Ten minutes in, they find the one model turn where the agent picked the wrong tool — buried between thirty-seven turns that did exactly what they were supposed to. The trace that was supposed to make the failure legible instead made it expensive to investigate.

The Approval Queue Nobody Drains

· 10 min read
Tian Pan
Software Engineer

You did the responsible thing. You looked at your agent, identified the actions that could cause real damage — issuing a refund, deleting a record, sending an external email, deploying a config change — and you routed them to a human for approval. Risk-tiered gating. Textbook. The review board signed off.

Then a customer escalation came in three weeks later: an agent task had been "in progress" since the previous Tuesday. Not failed. Not errored. Just sitting in a human approval queue that, it turned out, nobody was actually watching. The agent had done its job, parked the dangerous action behind a gate, and waited. The gate had no owner. The task aged silently in a place where no dashboard pointed and no alarm fired.

The Degradation Signals Your Agent Never Receives

· 9 min read
Tian Pan
Software Engineer

When a downstream API starts to wobble, a human operator finds out a dozen ways before anything actually breaks. The status page flips to yellow. A changelog email lands in the inbox. A warning banner appears in the provider's dashboard. The on-call channel lights up with a 429 someone spotted in the logs. A teammate posts "anyone else seeing slow writes?" None of these are responses to a request. They are the ambient operational signal that surrounds the API, and a human absorbs it almost passively.

An agent calling the same API receives exactly one thing: the response to the request it just made. Status code, headers, body. That is the entire channel. It has no inbox, no dashboard, no Slack, no peripheral vision. It cannot notice that the last ten calls each took twice as long as the ten before. It cannot read the status page, because nobody handed it the URL and it has no standing instruction to look. When the dependency degrades, the agent is the last party in the system to find out — and it usually finds out by failing.

This asymmetry is not a model capability problem. A smarter model does not fix it. The agent is blind to operational signals because the plumbing never delivers them, and most agent stacks ship without anyone noticing the plumbing is missing.

Your Eval Set Is a Frozen Photograph of Traffic Your Users Already Left

· 10 min read
Tian Pan
Software Engineer

You shipped a model upgrade. The eval suite went from 87% to 91%. The release notes wrote themselves, leadership clapped, and then the dashboards that actually matter — user satisfaction, escalation rate, thumbs-down ratio — did nothing. Flat. Maybe slightly worse.

This is one of the most disorienting failure modes in AI engineering, because nothing is broken. The eval ran correctly. The numbers are real. The model genuinely improved on the 600 examples you tested it against. The problem is that those 600 examples are a photograph of traffic from the week you built the suite, and your users have spent the months since then walking out of frame.

The Incident Ticket With No Repro Steps: Reproducibility as Something You Engineer

· 10 min read
Tian Pan
Software Engineer

The incident ticket is specific in the way only real incidents are. At 02:14 the support agent closed a customer account that should have been put on a 30-day grace period. The customer noticed. The ticket lands on your desk with a single line under "Steps to reproduce": unknown.

You open the trace. You can see the agent called close_account instead of set_grace_period. You can see the tool succeeded. What you cannot see is why the model chose that branch — and when you replay the same customer message through the same agent, it does the right thing. Twice. The postmortem now has a paragraph-shaped hole where the root cause should be, and the only honest thing you can write is "could not reproduce."

The LLM Judge Is a Versioned Dependency, Not Neutral Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most teams treat their LLM judge the way they treat a unit-test runner: neutral infrastructure that produces a number you can trust. You write a rubric, point a model at your outputs, and the judge returns scores. The scores go on a dashboard. The dashboard's trendline drives the roadmap. Nobody thinks of the judge as a thing that has behavior, because the whole point of automation was to take behavior out of the loop.

But the judge is a model. It has a version. It has biases. And the day it changes — because your eval-platform team swapped it for something cheaper, or because the provider silently rolled the weights behind a -latest alias — every historical score it produced becomes incomparable to every new one. Your quarter-over-quarter quality trend is now denominated in two different currencies, and no one printed an exchange rate.

This is not a hypothetical edge case. It is the default outcome of using an LLM as a measurement instrument without versioning it like one.

Task Completion Goes Green While Users Quietly Suffer

· 8 min read
Tian Pan
Software Engineer

Your agent dashboard says 94% task completion. Leadership is happy. The roadmap gets funded. And yet support tickets are climbing, power users have gone quiet, and the one engineer who actually watches traces keeps muttering that something is wrong. Both things are true at once. The agent is completing tasks. It is also taking twelve minutes and four thousand tokens to do a two-step job, backtracking three times, and asking the user to confirm a fact it could have inferred from the first message.

Task completion is a binary that hides a distribution. "The agent finished" tells you nothing about the path it took to finish, and the path is most of what users actually experience. A completion-rate dashboard is structurally incapable of seeing a slow, expensive, annoying agent. It will stay green right up until users churn.

This is not a measurement gap you can patch with a better prompt. It is a category error in what you chose to measure. Completion is the easiest thing to instrument and the least of what people are paying for.