Skip to main content

299 posts tagged with "observability"

View all tags

The Vendor SLA Gap: Why Your LLM Provider's Uptime Misses the Failure Mode That Breaks Your Product

· 9 min read
Tian Pan
Software Engineer

Your LLM provider says 99.95% availability. Your status page is green. Your latency dashboard is in the SLO. Your product is broken anyway — the assistant started refusing routine requests this morning, the JSON outputs that powered the downstream parser shifted from compact to chatty, and a third of the support tickets you triage with a model are coming back with "I can't help with that." Every one of those responses returned 200 OK in under 800ms. None of them violated the SLA. The SLA covered the failure mode you do not actually have.

This is the gap nobody priced into the procurement conversation. The vendor sells availability — a request-level promise that the API answered in time — and the product team consumes capability, which is a request-level promise that the answer was usable. The two are not the same metric, and the team that confuses them is one quiet model bump away from learning the difference.

Escalation Rate Is the Eval Signal Your Offline Tests Missed

· 10 min read
Tian Pan
Software Engineer

Every agent feature has a back door. Some teams call it "escalate to support." Some call it "route to a human reviewer." Some call it the templated "I'm not able to help with that — let me connect you to someone who can." Whatever the label, every production agent has a path that gives up on the user's request and hands it to a human, and the rate at which production traffic takes that path is one of the few signals that doesn't depend on labelers, judges, or a hand-built test set. It is the system telling you, in production, that the model could not handle a request the user actually sent.

That signal is almost always being read by the wrong team. Escalation rate is a workforce-planning metric in most companies: it determines how many human agents the queue needs next quarter, and it lives on a dashboard the operations team reviews on a different cadence than the AI team reads its eval scores. A 30% week-over-week escalation increase shows up as a staffing question in a Monday operations review, while the AI team's eval suite stays green and the leadership readout says the feature is healthy. Both teams are looking at the same production system and arriving at opposite conclusions: ops thinks they need more headcount, AI thinks the model is fine.

The Eval Ceiling: When Your Golden Test Cases Stop Discriminating

· 10 min read
Tian Pan
Software Engineer

A year ago, your eval suite did its job beautifully. Candidate models came back with scores spread between 60 and 80, and the ranking told you something. The new fine-tune beat the baseline by six points; the cheaper model lost three. Decisions flowed from the numbers. Today, every candidate scores 95 or 96 or 97 on the same suite, and the spread has collapsed into noise. Your team is still running the eval, still reading the report, still using it to green-light migrations — but the report has stopped containing information.

This is not benchmark contamination. It is not world-drift decay. It is a measurement-instrument problem: your test cases were calibrated for a difficulty level that the platform passed. The ruler hasn't broken; the things you're measuring have outgrown it. And the team that doesn't notice keeps making model decisions with a tool whose discriminating range no longer overlaps the candidates being compared.

Eval Selection Bias: Why Your Test Set Goes Blind to the Failures That Drove Users Away

· 10 min read
Tian Pan
Software Engineer

There is a quiet failure mode in production-grade LLM evaluation that no leaderboard catches: your test set is built from the users who stayed, so it never asks the questions that made the others leave. Quarter over quarter the eval scores climb, the dashboards turn green, and net retention sags anyway. The team chases "is the eval gameable?" when the real story is simpler and harder. The eval distribution drifted toward survivors, and survivors are exactly the population whose feedback you least need.

This is the WWII bomber armor problem in a new costume. Abraham Wald looked at returning planes, noticed where the bullet holes clustered, and pointed out that the holes you should reinforce against are the ones on planes that didn't come back. Replace bombers with users, replace bullet holes with failed turns, and you have the central pathology of eval sets seeded from production traces.

The Fallback That Became the Default: Why Your Tier Mix Needs an SLO

· 11 min read
Tian Pan
Software Engineer

The dashboard says the fallback fires on 0.5% of requests. The dashboard has been saying that for six months. Then someone re-runs telemetry from scratch and finds the secondary model is serving 38% of traffic and the canned-response tier is serving another 9%. The frontier-model "primary path" the team has been talking about in roadmap reviews is, in fact, the minority experience. Nobody noticed because no single alert ever fired — every demotion was a small, well-justified, locally correct decision, and the cumulative drift never crossed any threshold someone had thought to set.

This is the failure mode I want to name: the fallback that became the default. It is not an outage. It is not a regression in any single component. It is a slow rotation of the product surface where the degraded path stops being a safety net and starts being the experience. The team's mental model and production reality drift apart, and the gap is invisible because the only meters in place are designed to detect failure, not to detect mix.

I'll claim something stronger: if your AI feature has more than two tiers of service, your tier mix is itself an SLO, and if you aren't measuring it, you don't actually know what you ship.

The Five Definitions of 'Now' Inside Your LLM Prompt

· 11 min read
Tian Pan
Software Engineer

A customer support agent told a user "based on our latest pricing, as of today" and quoted last quarter's price sheet. The system prompt interpolated today is {current_date} correctly. The retrieval layer pulled the document with the highest freshness score. The model answered confidently. Every component did exactly what it was specified to do, and the user got a wrong answer that the on-call engineer could not reproduce because, by the time they replayed the trace at 9pm, "today" was a different day.

This is not a rare bug. It is a failure mode that lives in almost every production LLM pipeline because "now" is implicit in the prompt at five different layers, and those layers were authored at different times, by different people, against different definitions of the present. As long as a request runs synchronously from a foreground user session, the layers mostly agree. The moment the request is replayed for debugging, batch-processed overnight, run from an eval harness pinned in March, or queued and consumed an hour later, the layers start disagreeing — and the model produces an answer that is internally consistent within its prompt but externally wrong.

Hyrum's Law for Streamed Reasoning: Pacing, Pauses, and Intermediate Tokens Are an Undocumented Contract

· 11 min read
Tian Pan
Software Engineer

A team upgrades from a frontier model to its faster successor. The eval suite is green. Final answers match. Tool-call schemas are identical. The structured outputs validate against the same JSON schema they always did. They ship. Within a day, support tickets pile up: "the assistant feels rushed," "it's not really thinking anymore," "something is off." The product manager pulls telemetry and finds task-completion rates unchanged. The engineering team double-checks the eval and the schema and finds nothing wrong. The complaint is real, but the contract — as the team defined it — is intact.

What changed is the texture of the stream. The old model paused for 800 milliseconds before calling a tool, emitted a "Let me check that..." preamble, and dribbled tokens at roughly 35 per second with natural-feeling clusters around clause boundaries. The new model emits tokens at 90 per second, never pauses, and skips the preamble entirely. None of that was in any documented contract. All of it was load-bearing.

This is Hyrum's law, and streaming makes its surface area enormous. Any observable behavior of your system will be depended on by somebody — and a streaming AI surface exposes far more observable behavior than the team realizes.

Multi-Axis Agent Bisection: When the Regression Lives in the Interaction

· 11 min read
Tian Pan
Software Engineer

Quality regressed overnight. The on-call engineer pulls up the dashboard, traces a few bad sessions, and starts the obvious bisection: the model provider rotated to a new snapshot at 02:00 UTC, so revert to the pinned older alias. Eval suite still red. Roll back yesterday's prompt change. Still red. Pin the retrieval index back to last week's version. Still red. Each owning team rolls back their own axis in isolation and reports "not us." Three hours in, nobody owns the diagnosis because nobody owns the interaction surface where the regression actually lives — the new model interpreting the new tool description in a way the old model never would have.

This is the failure mode single-axis tooling can't solve. git bisect works because the search space is one-dimensional: a linear sequence of commits. An agent doesn't have one timeline. It has four or five timelines running in parallel — model snapshot, system prompt, tool catalog, retrieval index, sampling config — each with its own owner, its own deploy cadence, and its own "rollback" button that returns just its axis to a known state. The regression you're chasing is often a two-factor interaction, and bisecting along any single axis returns false negatives because the bug only fires on the cross-product cell where the new model meets the new tool description.

Tool Behavior Drift: The Schema Held, the Semantics Didn't

· 11 min read
Tian Pan
Software Engineer

Your contract tests are green. The schema validator is happy. The tool returns the same shape it did last quarter. And the user-facing answer has been quietly wrong for six weeks.

This is the failure mode that contract testing was never designed to catch. Contract tests verify that the wire format hasn't changed — that search() still returns { results: [{ id, title, score }] }, that create_event still accepts an ISO 8601 string, that the geocoder still emits { lat, lng }. What they don't catch is the moment the search endpoint starts ranking by recency instead of relevance, the calendar API silently snaps your 14:07 start time to 14:00 in the EU region, the geocoder picks a different point inside the same ambiguous polygon, or the LLM-classifier-as-a-tool is upgraded to a new model behind a stable endpoint and the false-positive rate moves four points in a category your eval set never sampled. The schema held. The behavior didn't. Your agent kept reading green checkmarks and produced regressed answers no error log captured.

Tool Latency Tail: Why p99 Reshapes Agent Architecture and p50 Hides the Problem

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter launched a seven-step agent and built its latency budget the obvious way: search returns in 200ms, the SQL lookup takes 80ms, the email send is 150ms, and so on down the chain. Add the medians, sprinkle in some buffer, and the math says the agent fits comfortably inside its two-second SLA. The dashboards confirmed it for weeks. Median latency was beautiful. Then customers started complaining the feature was unusably slow, and the dashboards still looked green.

The story they were telling each other was wrong because they had built the architecture around sum(p50) while users were experiencing sum(p99). After three or four hops, the probability that any link in the chain has fallen into its own tail is no longer negligible. After seven hops, it approaches a coin flip. None of the per-tool dashboards ever turned red because none of the per-tool services were misbehaving — the problem was that nobody owned the multiplicative composition.

This is not a new lesson. Distributed-systems researchers have been writing about it for forty years. What's new is that every team building agents is rediscovering it, badly, on a deadline.

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

· 10 min read
Tian Pan
Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

Wall-Clock Deadline Drift: Why Your Agent Thinks It Has Time It Doesn't

· 9 min read
Tian Pan
Software Engineer

A user clicks send. The agent is configured with a thirty-second budget. The planner inspects the task, sees a deep-research path that takes about twelve seconds and a quick lookup that takes three, and confidently picks the deep path because "we have plenty of time." Twenty-eight seconds later the response lands, two seconds past the SLA the team published last quarter. The dashboard says the agent's reasoning was correct. The retry logic was correct. The tool calls succeeded. Nobody can explain why the user's spinner sat for forty-six seconds.

The bug is not in any single component. It is in the seam between them, in a value the system never thought to refresh: the agent's belief about how much time is left. Somewhere between request acceptance and the model's next planning step, a transparent retry happened, the wall clock advanced, and the deadline metadata didn't. The model is now reasoning about a budget it cashed out fifteen seconds ago and doesn't know it.