Skip to main content

299 posts tagged with "observability"

View all tags

The Agent That Scheduled Itself Into the Maintenance Window

· 11 min read
Tian Pan
Software Engineer

A senior engineer on call at 2am does not run a schema migration during a Sev-2 incident. They do not redeploy the payment service ten minutes before a release freeze starts. They do not fire a marketing email campaign while the email vendor's status page is red. None of this is in their job description. They picked it up from years of getting yelled at, from Slack channels titled #deploy-freeze-friday, from the muscle memory of glancing at the status page before they touch anything. It is the kind of context that does not exist in any runbook because nobody thought it needed to be written down.

Now hand the same job to an agent. The agent has tools. The agent has a multi-step plan. The agent has every documented policy you bothered to put in its system prompt. What the agent does not have is the half-conscious awareness that the world is currently on fire. So it executes the plan. Cleanly. Confidently. Into the maintenance window. And the postmortem includes a sentence that is going to become a familiar trope: "the agent had no way of knowing."

Multimodal Traces: When Modalities Must Share an ID

· 11 min read
Tian Pan
Software Engineer

A user called your support agent. They talked, the agent listened, the user uploaded a screenshot of the error mid-call, the agent reasoned over the image and the transcript, and the conversation ended with a follow-up email summarizing the fix. Three days later the user files a complaint: the fix did not work, and the email never arrived. You open your observability stack and you find three separate traces in three separate UIs. The voice pipeline shows you an ASR trace. The vision pipeline shows a span over the image upload. The LLM call shows a chat trace with a token count and a tool call. Nothing in any of these dashboards tells you they were the same conversation.

This is the postmortem nobody wants to write. Not because the data is missing — every individual modality logged what it was supposed to — but because the join across modalities was never built. Each pipeline grew its own tracing convention from whatever its model vendor shipped by default, and the conversational turn that bound them together exists only in the head of the engineer who designed the agent.

The Agent That Retried Its Way Past Your Rate Limit

· 10 min read
Tian Pan
Software Engineer

Your gateway enforces a clean 100 requests per second per tenant. The dashboard shows every tenant comfortably under that ceiling. The bill from your model provider says you blew through the spend cap anyway. Nobody on the rollout call has a clean story for why.

The answer is that the rate limiter and the bill are measuring different things. The limiter sees one "user request" when a customer clicks a button. The provider sees a planner call, three tool-result reflections, a format-correction retry triggered by a stricter JSON schema, and a final synthesis — each with its own internal retry budget that fires when a transient 429 or 500 comes back. A single click can fan out into thirty model calls. The limiter counts one. The bucket leaks at thirty times the rate it was sized for.

Rate-limiting an agentic system at the HTTP boundary is enforcing speed limits at the highway entrance while the cars inside multiply. Until the limiter understands the loop, the loop will route around it.

The Cached Prompt Prefix That Grew Arms and Legs

· 11 min read
Tian Pan
Software Engineer

Six months ago your prompt prefix was 4,000 tokens. It was stable, cache-warm, and amortized to almost nothing — the per-call surcharge for system instructions was a rounding error against the per-call cost of the response. Today that prefix is 11,000 tokens, your cache hit rate has slid from 92% to 31%, and your inference bill is up 4x. Nobody on the team can point to the PR that did it. There is no commit message saying "increase prompt tokens by 7,000." Every change was small, every change was defended, every change shipped clean.

The prefix grew arms and legs the way a basement collects boxes. One team needed the user's tier injected so the agent could explain plan limits. Another needed today's date in the user's timezone for "remind me tomorrow" to work. A third stapled in the active A/B variant name so eval traces could be sliced. Marketing added the current promo banner so the agent could mention it on prompt. Compliance added a feature-flag manifest so the model could refuse beta features for users not in the rollout. Each was a one-line addition. Each was defensible in isolation. The aggregate destroyed your cache.

The Dead Tool Nobody Can Remove From the Registry

· 10 min read
Tian Pan
Software Engineer

A tool has been sitting in your shared agent catalog for fourteen months. It was wired up by an engineer who has since left, for a workflow that was sunset two reorgs ago, against a backend service whose owners are no longer sure who they are. The tool definition is 380 tokens. It ships in every system prompt for every agent in the org, on every turn, because nobody can prove it is unused, and the cost of being wrong about that proof is higher than the cost of carrying it forever.

That tool is the database column nobody dares drop. It is the cron job whose log file rotated out years ago. It is the dead code path you can grep for and find zero references to, except eval() exists and you cannot be sure. The agentic version of this problem is worse, because the carrying cost is not merely some bytes on disk — it is paid in tokens, in selection accuracy, and in security surface, on every single inference your platform runs.

The Production Logs Your Agent Cannot Read

· 9 min read
Tian Pan
Software Engineer

You wired your incident-response agent into Splunk. You gave it the query syntax in the system prompt, a tool to execute SPL, and a fresh API token. The first time it triaged a real page, it pulled the wrong logs, summarized the wrong service, and confidently named the wrong customer. The integration was perfect. The agent was useless.

Here is what you forgot. Fifteen years of log conventions, undocumented field names, severity strings that drifted from ERR to error to ERROR across three reorgs, and team-specific suffixes that turn customer_id into cust_id_v2_actual on the auth service and tenant.user.id on billing — none of that is in the prompt. You gave the agent access to the API. You did not give it access to the institutional knowledge that makes the API useful.

The shape of this failure is bigger than Splunk. It applies to any agent integration where the tool exposes a query language over a corpus the team has been shaping by hand for a decade. The agent has the verbs. It does not have the nouns.

Your Agent's Audit Log Records Everything Except the Reason

· 11 min read
Tian Pan
Software Engineer

Compliance forwards you a ticket. A customer was denied a refund by your support agent three weeks ago, they have escalated, and now someone needs to explain the decision. You feel calm about this, because you instrumented everything. Every prompt, every tool call, every retrieved chunk, every token count, every latency number — it is all in the trace, and you can pull it up in seconds.

You pull it up. You can see the agent received the refund request. You can see it called get_order_history, then check_return_window, then lookup_policy. You can see the exact policy text it retrieved. You can see the final message it sent: refund denied. The trace is complete. Every span is green. And you still cannot answer the question, because the trace shows you that the agent denied the refund and shows you everything it looked at, but it does not show you why those inputs added up to no. The reason lived in how the model weighed the context, and that weighing was never an artifact. It was never written down anywhere.

This is the gap between a trace and an explanation, and almost every team that says "we have full observability" has not noticed they only built the first half.

Your Eval Set Only Has Problems You Already Solved

· 9 min read
Tian Pan
Software Engineer

Your eval score went from 0.81 to 0.87 over the last quarter. The team shipped a router, swapped in a stronger model on the hard intents, tuned the system prompt, and added forty new test cases harvested from "tickets that took more than a day to close." The dashboard says you got better. NPS is flat. Active users are down two percent.

There is a clean story that explains both numbers, and you don't want to hear it. Your eval set only contains problems you already solved. The queries that failed so badly the user never filed a ticket, never came back, and never showed up in any log you grep — those are not in your suite. They are not in anyone's suite. A rising eval score is consistent with getting better at the things you can see, and it is also consistent with getting better at the things you can see while staying exactly as bad at the things you cannot.

From a Bug to a Behavior Rate: The AI Postmortem Without a Reproducer

· 10 min read
Tian Pan
Software Engineer

A user files a ticket. The agent told a paying customer their refund would be processed in seven hours when the documented SLA is seven days. Screenshot attached. You pull the trace, find the exact prompt, the exact tool calls, the exact model and seed. You replay it. The model says seven days. You replay it again. Seven days. You replay it a hundred times. It says seven days ninety-eight times and "by end of day" twice, and never once says seven hours. The screenshot is unambiguous. The replay disagrees. The postmortem due Friday now has a "Root Cause" section and no root cause to put in it.

This is the shape of most AI incidents that reach a postmortem. Not the obvious outages — those have stack traces and 500-rate graphs and recover the way every SRE has been trained to expect. The hard ones are the single bad output that left a victim, erased its own conditions on the way out, and refuses to come back when you summon it. Every postmortem template you have ever used assumes a reproducer. Agents do not give you one.

The Demo Worked Because You Were Watching: Session Length Is the Eval Dimension Your Suite Forgot

· 10 min read
Tian Pan
Software Engineer

The reliability number in your launch deck came from sessions that looked nothing like the ones your users actually run. The demo was five turns: open, ask, observe a tidy answer, refine once, conclude on a high note. The session your power user ran yesterday was thirty-one turns long, included two tool failures the agent papered over with optimism, and ended when the user gave up and opened a support ticket. Both sessions came out of the same model. The first one shipped a press release. The second one was filed under "edge case."

Session length is a dimension of evaluation, and demo culture systematically underweights it. We measure per-turn accuracy because per-turn accuracy is what fits on a slide, and then we are surprised when per-session success falls off a cliff that we never put on any chart. The cliff is not random and it is not a tail event — it is the predictable consequence of compounding error, attention drift, and committed assumptions that the model will not revisit. The question every team should be asking is not "how good is the model" but "how good is the model at turn twenty-eight, given everything we said at turns one through twenty-seven."

The Filler Tool Call: When Agents Perform Diligence Instead of Doing Work

· 9 min read
Tian Pan
Software Engineer

Open the trace of any production agent and look at the tool calls that ran between the user's question and the first useful action. You will find a get_user_profile that returned a name nobody used, a check_status that came back green and was never referenced, a list_recent_orders whose result was summarized as "ok" and dropped on the floor. None of these calls changed the answer. All of them cost real money, real latency, and a real line in the trace. Your agent has learned to look diligent — and looking diligent is now your single largest source of waste.

This is the filler tool call: an action the agent emits not because it needs the result, but because the surrounding pattern of "thinking out loud, then acting" has been rewarded enough times during training that the model now performs thoroughness as a side effect of answering anything. It is the LLM equivalent of a junior analyst opening five tabs they never read so the senior across the room sees activity. The difference is that the junior gets bored. The agent never does.

The Bug You Can't Reproduce Because the Model Picked a Different Token

· 10 min read
Tian Pan
Software Engineer

A user files a bug. The summary your agent generated dropped a critical paragraph, or the JSON came back malformed, or the answer was confidently wrong. You open the ticket, copy the request, and replay it. It works. You replay it again. Still works. You mark the ticket "cannot reproduce" and move on.

The bug is still there. It is still happening to real users. You just closed it because your debugging toolchain assumes that a fixed input produces a fixed output — and the component you are debugging samples from a probability distribution.