Skip to main content

299 posts tagged with "observability"

View all tags

Found Capabilities: When Users Ship Features Your Team Never Roadmapped

· 10 min read
Tian Pan
Software Engineer

A customer emails support to ask why your CRM agent stopped drafting their NDAs. You did not know your CRM agent drafted NDAs. A power user complains that your support bot's Tagalog translations have gotten worse since last week. You did not know your support bot did Tagalog. A forum thread spreads a prompt that turns your code-review assistant into a passable security scanner, and within a quarter you are getting CVE reports filed against findings the assistant produced. Each of these is a feature with adoption, business impact, and zero institutional ownership — no eval, no SLA, no surface in the UX, no roadmap entry, and a quiet bus factor of one: the customer who figured it out.

This is what happens once your product is wrapped around a model whose capability surface is wider than the surface you scoped. Users explore the wider surface, find behaviors that solve their problems, build workflows on top of those behaviors, and then experience your next model upgrade as a regression even though nothing on your roadmap moved. The contract between you and your users is no longer the one you wrote down. It includes everything the model happened to do for them that you happened not to break.

Treating this as an engineering surprise — "we will harden the prompt, we will add a guardrail, we will catch it next time" — is a category error. Found capabilities are a product-management problem. The discipline is not preventing them; it is detecting them, deciding what to do with them, and remembering that you decided.

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

· 11 min read
Tian Pan
Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding

· 12 min read
Tian Pan
Software Engineer

A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.

Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.

This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.

Prompt-Version Skew Across Regions: The Unintended A/B Test Your CDN Ran for Six Hours

· 10 min read
Tian Pan
Software Engineer

You shipped a system-prompt change at 09:14. The rollout dashboard turned green at 09:31. By 11:00 your eval tracker still looked clean, the cost dashboard was unremarkable, and a customer-success engineer pinged the team: structured-output errors on the parser side were up about three percent in Asia-Pacific only. Nothing in North America. Nothing in Europe.

The rollout had paused itself at 67% region coverage because a non-load-bearing health check on one POP flapped during the cutover, and nobody had noticed. For six hours, us-east and eu-west were running prompt v47 while ap-south and ap-northeast were still on v46. You were running a live A/B test split by geography — except you didn't design the test, you couldn't see the test, and the eval suite that was supposed to catch quality regressions was hitting the new version in one region and shrugging.

This failure mode is not a bug in any single tool. It is the predictable consequence of pushing prompts through deployment systems built for a different kind of artifact.

Reachability Analysis for Agent Action Spaces: Eval Coverage for the Branches You Never Tested

· 12 min read
Tian Pan
Software Engineer

The first time anyone on your team learned that the agent could call revoke_api_key was the morning a well-meaning user typed "this token feels old, can you rotate it for me?" The tool had been registered six months earlier as part of a batch import from the auth team's MCP server. It had passed schema validation, appeared in the catalog enumeration, and then sat. No eval ever invoked it. No production trace ever touched it. Then one prompt, one planner decision, and the incident channel learned the tool existed.

This is the failure mode that hides inside every agent with a non-trivial tool catalog. Forty registered functions and a planner that can compose them produce a reachable graph of plans whose long tail you have never observed. The assumption that "we tested the common paths" papers over the fact that the dangerous branch is, almost by definition, the one you never saw.

Reasoning-Effort Budgeting: When Thinking Tokens Become a Finance Line Item

· 11 min read
Tian Pan
Software Engineer

The first time your finance team asks why a single user racked up a fifty-cent answer to a one-tenth-of-a-cent question, the call will not be about the model. It will be about the line on the invoice that did not exist twelve months ago: reasoning tokens. They look like output tokens on the bill, they bill at output-token rates on most providers, and they have no natural ceiling. A query that would have produced a four-hundred-token reply on a non-reasoning model can quietly burn eight thousand internal thinking tokens to get there — and the only person who notices is the one reconciling the spend.

For most of the API era, "tokens used" was an honest number. You sent a prompt in, you got a response out, and the bill was a clean function of both. Reasoning models broke that intuition. The model now generates a hidden, billable, internally-only-visible chain of thought before it emits the answer the caller will read, and the size of that chain depends on the model's own assessment of how hard the question was. The user-visible output may be a single sentence. The bill may be for ten pages.

The Router Is the Product: Why Your Cheap Classifier Decides More Behavior Than Your Flagship Model

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped what they called "the routing project": a tiny BERT classifier in front of their flagship model that decided whether a query was simple enough for a cheaper, faster fallback. It paid for itself in three weeks. The cost dashboards lit up green. The flagship's eval suite — three hundred adversarial cases, weekly grading runs, the works — still passed every Friday.

Six weeks in, retention on a particular product surface dropped four points and nobody could find the cause. The flagship was fine. Latency was fine. The router, it turned out, was sending 71% of queries to the cheap model. It had been since week two. The cheap model was the product for most users, and the cheap model had no eval suite at all.

This is the most common failure mode I see in 2026 among teams that adopted LLM routing for cost control: the eval discipline gets attached to the expensive tail of the system, and the cheap head — the part that defines the product for most of the request volume — runs blind.

Your Shadow Eval Set Is a Compliance Time-Bomb

· 10 min read
Tian Pan
Software Engineer

The most dangerous data store in your AI stack is the one nobody designed. It started with a Slack message during a sprint: "Real users are the only thing that catches real bugs — let's tap a percentage of production traffic into the eval pipeline so we can replay it nightly." Six engineers thumbs-upped the message. Nine months later, the bucket holds 4.3 million traces, an eval job pages the on-call when failure rates rise, and the failure cases are emailed verbatim to a Slack channel where forty people can read them. The traces include email addresses, internal company names, partial credit-card digits, employee phone numbers, and customer support transcripts where users explained why they were upset.

Nobody mapped the data flow. No DPIA covered it. The privacy review last quarter looked at the model vendor's API; it didn't look at your eval job. And then a data-subject deletion request arrives, and the team discovers that "delete this user's data everywhere" is a sentence that no longer maps to anything they can actually do.

The AI Feature Metric Trap: Why DAU and Retention Lie About Stochastic Surfaces

· 11 min read
Tian Pan
Software Engineer

A PM walks into the AI feature review with a slide that reads "+12% engagement, +8% session length, retention up 3 points." The room nods. Two desks over, the support lead is staring at a different chart: tickets touching the AI surface are up 22%, and the most common resolution code is "user gave up, agent helped manually." Both numbers are real. Both come from the same product. The PM's dashboard is built on the assumption that the AI feature emits the same shape of event as the button it replaced. It doesn't. And the gap between what the dashboard counts and what the user experienced is where AI features quietly fail in plain sight.

The deterministic-feature playbook treats interaction as a click stream: user fires an event, the system reacts, the user moves on. AI features have a different event shape — a task arc with phases, retries, side trips to a human, and an offline judgment the telemetry never sees. Importing the deterministic dashboard onto that arc is the analytics equivalent of running 2018's interview loop against 2026's job. The numbers go up. The thing the numbers were supposed to predict goes down.

Your stop_reason Is Lying: Building the Real Stop Taxonomy Production Triage Needs

· 12 min read
Tian Pan
Software Engineer

The on-call engineer pulls up a trace. The model returned, the span closed clean, the API call shows stop_reason: end_turn. By every signal the platform offers, this was a successful generation. Three minutes later a customer reports that the agent confidently wrote half a config file, declared the operation complete, and moved on. The trace had no warning sign because the warning sign isn't in the API contract — the provider's stop reason has four to seven buckets, and the question your incident demands an answer to lives in the gap between them.

Stop reasons are the field engineers reach for first during triage and the field that lies most cleanly when it does. The values are designed for a runtime that needs to decide what to do next: was this turn complete, did a tool get requested, did a budget get exceeded, did safety intervene. They are not designed for a human reconstructing why an answer went wrong, and the difference between those two purposes is where production teams burn entire afternoons.

Trust Ceilings: The Autonomy Variable Your Product Team Can't See

· 10 min read
Tian Pan
Software Engineer

Every agentic feature has a maximum autonomy level above which users start checking work, intervening, or abandoning the feature entirely. That maximum is not a property of your model. It is a property of your users, your domain, and the cost of being wrong, and it does not move because a launch deck says it should. Most teams discover their ceiling the hard way: a feature ships designed for full autonomy, adoption stalls at "agent suggests, human approves," the metrics blame the model, and the next quarter is spent tuning a knob that was never the bottleneck.

The shape of the ceiling is consistent enough across products that it deserves a name. Anthropic's own usage data on Claude Code shows new users using full auto-approve about 20% of the time, climbing past 40% only after roughly 750 sessions. PwC's 2025 survey of 300 senior executives found 79% of companies are using AI agents, but most production deployments operate at "collaborator" or "consultant" levels — the model proposes, the human disposes — not at the fully autonomous tier the marketing implied. The story underneath those numbers is not that users are timid. It is that trust is calibrated to the cost of a recoverable mistake, and your product almost certainly does not let users see, undo, or bound that cost the way they need to.