Skip to main content

299 posts tagged with "observability"

View all tags

The Finish Reason Your Code Never Inspects

· 10 min read
Tian Pan
Software Engineer

Your handler did everything right. The HTTP status was 200. The body parsed. The text field had characters in it. You incremented responses_succeeded, appended the message to the conversation, returned the JSON down to the client, and moved on. The user got a sentence that ended mid-clause, a redacted answer dressed up as a normal one, or a polite refusal phrased as a completion. Your dashboard does not know any of that happened. The provider told you. You did not read the field.

Every major inference API returns a stop signal alongside the text: OpenAI calls it finish_reason, Anthropic calls it stop_reason, Gemini calls it finishReason. The field is small. It is one enum value per response. It is also the only out-of-band channel the model has for telling you whether the response you just shipped is the answer or a fragment of one. Treating it as cosmetic is the same shape of bug as ignoring HTTP status codes — except your monitoring caught the HTTP one a decade ago and has no opinion about this one.

The Latency Budget Your Agent Loop Stole from the Search Box

· 12 min read
Tian Pan
Software Engineer

The launch metrics looked clean. Answer quality up, citation rate up, the eval suite green. The team that replaced the old keyword search with an agent-backed retriever shipped, took the win, and moved on. Six weeks later somebody noticed the weekly active number on that surface had drifted down twelve percent and nobody could find the regression. There was no regression. The agent worked. The users left because the box that used to answer in two hundred milliseconds now took four seconds, and nothing in the launch retro had a budget for that.

This is the latency-budget transfer problem, and almost nobody draws the org chart that catches it. A search box is not just a function call. It is a thirty-year contract with the user's nervous system: type, see results, scan, click. The 200-millisecond response is not a performance metric on a dashboard somewhere — it is the reason the user's attention is still on the screen when the results arrive. When the team underneath the box replaces a keyword index with an agent loop, the function-call surface looks identical and the SLA on the new call lives in a completely different regime. The latency budget moved from the team that owned the index to the team that owns the agent, and from the team that owns the agent to the user, and the only one who showed up to the meeting was the user.

The Success Metric That Improved Because the Model Declined the Hard Cases

· 9 min read
Tian Pan
Software Engineer

You bumped the model on Tuesday. By Friday, the "task completion rate" dashboard had climbed from 71% to 78%. Leadership noticed. Someone screenshotted it for the all-hands. Two weeks later, support quietly flagged that churn on a specific cohort of complex tickets had doubled. Nobody connected the two events because, on paper, the agent got better. In reality, the new model just got better at refusing.

This is the metric decoupling problem, and it is one of the most expensive ways an LLM-powered product can deceive its own builders. Your success rate did not measure what you thought it measured. It measured the intersection of what the model attempted and what the model got right when it attempted. When a model upgrade, a prompt change, or a safety-tuning pass shifts the boundary of "attempted," your numerator and your denominator move together — and the ratio can go up even as user-perceived quality falls off a cliff.

The Synthetic Eval Your Real Users Never Resemble

· 10 min read
Tian Pan
Software Engineer

There is a class of eval failure that no dashboard catches because it shows up as success. The score climbs week over week. The judge agrees with the answer. The regression tests stay green. Meanwhile, the support team is logging a slow drift in user-reported quality, sales is hearing "it doesn't quite get what I meant," and nobody in engineering can reproduce the complaint because every example anyone tries on the eval set passes. The eval and the users live in different distributions, and the eval is the more polished of the two.

The mechanism is simple, and it hides in plain sight: the model that wrote your eval prompts and the model under test are siblings, and siblings share priors. They smooth the same edges, prefer the same phrasings, leave out the same kinds of malformed input. The eval certifies behavior on a world the generator imagined users have. Your actual users live somewhere else.

The Token Budget You Cannot See Until You Hit It

· 10 min read
Tian Pan
Software Engineer

Your team negotiated a monthly token allocation with your inference provider. The contract specifies the cap. The dashboard in the provider portal shows yesterday's usage with a one-day lag. The API itself returns per-minute rate-limit headers — anthropic-ratelimit-tokens-remaining, x-ratelimit-remaining-requests — and nothing about the monthly bucket you actually have to plan against. And your agent fleet has no mechanism to slow down as the budget depletes, because the only signal that arrives in real time is the 429 — which arrives after the budget is already gone, dressed up as the same transient error your retry logic was tuned to ignore.

This is a different shape of problem than rate limiting. Rate limits are a fast-moving throttle the consumer must react to within seconds; the headers tell you the bucket has a thousand tokens left and refills in forty seconds, and a well-written client backs off and tries again. Monthly quota is a slow-moving budget the consumer must plan against over weeks. The two get confused because they share the failure code and sometimes share the dashboard, but they require different controls — and the gap between what the provider exposes and what the consumer needs is where the worst incident of the month lives.

The Tool Version Bump Your Agent Quietly Adapted To

· 10 min read
Tian Pan
Software Engineer

A downstream search service ships v2.3.2 on a Tuesday afternoon. The release notes mention a renamed status field, a new nullable confidence value, and a reordered array in the result envelope. Nothing in the CHANGELOG is marked breaking. The provider's own client libraries absorb the change in a point release. Your team's HTTP integrations would have logged a deserialization error inside an hour. Your agent — the one routing customer questions through that search tool — does not. It keeps answering. The questions still resolve. The dashboards stay green.

Six weeks later, someone notices that "out of stock" replies have crept up from two percent of queries to eleven. The root cause is the v2.3.2 bump. The renamed status string changed from in_stock to available, and the agent — being a flexible reasoner over text rather than a schema-strict client — interpreted the absence of the old token as "not available," then phrased that finding into helpful, confident, wrong customer messages. The contract regression was absorbed on the consumer side, where no test suite was watching.

This is the failure mode that conventional API hygiene was never designed to catch. Strict clients break loudly. Agents break quietly. And the longer you treat your agent like a normal HTTP consumer, the longer this class of bug hides inside metrics that look fine.

The Trace Replay Your New Model Cannot Trust

· 12 min read
Tian Pan
Software Engineer

The standard playbook for an LLM upgrade has the comforting shape of a unit test. Capture last week's production traces against the incumbent model. Replay them against the candidate. Diff the outputs. If the disagreement rate is below some threshold — say 3% — ship it. The diff is small, the dashboard is green, the migration looks safe. A week later, the on-call channel fills with reports that the new model is forgetting context across turns, calling tools with arguments that no longer parse, and confidently citing documents that have been deleted from the corpus.

The replay didn't lie, exactly. It measured a real thing. It just measured behavior in a context the production model never actually saw, and the green number is a confidence interval over a distribution that doesn't exist anywhere except in the replay harness.

The Trace That Stops at the Provider Boundary

· 11 min read
Tian Pan
Software Engineer

You did the tracing work. Retrieval has a span. Tool calls have spans. The orchestration loop has a span. A trace ID rides through every internal hop on W3C traceparent headers, just like the SRE playbook says. Then the request hits messages.create, the SDK records a single span called llm.call, and the next 2.8 seconds of your pipeline turn into a black rectangle on the flame graph with no internal structure. The 800 milliseconds before the first token shows up: opaque. The 2 seconds of decode after that: opaque. The share of the wall clock that was network, queue wait, prefill, or per-token decode: unknowable from your trace.

When a customer reports "the assistant felt slow today," your dashboard can confirm the slowness. It cannot localize it. The most expensive minute of your pipeline — measured in dollars, in p95, in user-visible lag — lives inside a vendor's data center, and the contract you accepted when you signed up gives you almost no visibility into it. You are on call for a black box.

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

The Agent That Burned an Hour Re-Trying the Same 400 Error

· 10 min read
Tian Pan
Software Engineer

An agent calls a tool. The tool returns 400 Bad Request with a perfectly structured error body: {"error": "missing required field", "field": "email"}. The agent's reasoning chain quotes the error back word for word, says "I need to include the email field," and then emits the next tool call with the exact same payload as before. The loop's retry policy, written years ago for a world of flaky HTTPS connections, treats the 400 as transient and fires it again. Three times. Eight. Fifteen. An hour later, the agent has burned through context, money, and a rate-limit budget without ever sending a request the tool could accept.

The failure looks like a model problem. It is not. The model read the error correctly. The harness around it never gave the correction anywhere to land.

The ChatOps Bot That Mistook Silence for Consent

· 10 min read
Tian Pan
Software Engineer

Your deploy bot has been live for nine months. The dashboard says message volume is up and to the right. The thumbs-down rate is stable below two percent. The team that ships it interprets this as adoption. Then a staff engineer mentions, almost in passing, that everyone on his squad muted the channel back in February — they trust the bot's hourly digest about as much as they trust a vendor newsletter, and they got tired of the buzz. The bot is talking to an empty room and the metric calls that traction.

This is the failure mode most chatops teams hit and almost none of them measure. When a bot in Slack or Teams stops getting replies, the easy read is "the agent has reached a steady state — users don't need to argue with it anymore." The honest read is usually the opposite: users are routing around it, muting it, or learning that ignoring the prompt is cheaper than reading it. The engagement chart can't tell you which. The instrumentation has to be redesigned around the assumption that silence is the default and that interpreting it correctly is the whole job.

The Evidence Locker Your Agent Doesn't Keep

· 9 min read
Tian Pan
Software Engineer

Your trace logs every token. They log every tool call, every retry, every retrieval latency, every model id. They look exhaustive. Then a regulator, a customer, or your own incident channel asks the one question that should be easy: what did the model actually see at the moment it decided? And you discover that your trace recorded the questions but not the answers the model was looking at when it answered.

The retrieved chunks have rotated out of the vector store because the corpus was reindexed last Tuesday. The tool response was a streamed payload you stored only the final-state summary of, because storing the full stream tripled your bill. The system prompt was assembled at runtime from a feature flag that has since flipped twice, and your flag service does not retain historical values by timestamp. You have full observability over what happened — the call graph, the token counts, the latencies. You have nothing about what the model was answering against. That gap is the difference between a trace and a decision record, and most teams have not noticed they only built one of the two.