Skip to main content

238 posts tagged with "reliability"

View all tags

Two-Hop Tool Chains: Why 95% Tools Compose Into 80% Pipelines

· 10 min read
Tian Pan
Software Engineer

The per-tool dashboard in your observability stack tells a comforting lie. search_listings is green at 96%. book_appointment is green at 95%. The agent that uses them back-to-back has been at 78% for three weeks and nobody can explain why. The reason isn't in either tool. It's in the seam between them — the place no dashboard panel exists.

Composition is not addition. When tool A's output flows into tool B's input, the failure surface isn't 1 - (0.96 × 0.95) against B's narrow definition of "valid call." It's the full cartesian product of every way A can be subtly off by B's standards: a date string in MM/DD/YYYY when B expects ISO 8601, a price returned in cents when B parses dollars, a paginated cursor that points one item past the last result, an entity ID that was renamed on the upstream service yesterday. Any of these passes A's own contract tests cleanly. Each one breaks B. The team's per-tool reliability metrics never see it because each tool is, by its own standards, fine.

The Vendor SLA Gap: Why Your LLM Provider's Uptime Misses the Failure Mode That Breaks Your Product

· 9 min read
Tian Pan
Software Engineer

Your LLM provider says 99.95% availability. Your status page is green. Your latency dashboard is in the SLO. Your product is broken anyway — the assistant started refusing routine requests this morning, the JSON outputs that powered the downstream parser shifted from compact to chatty, and a third of the support tickets you triage with a model are coming back with "I can't help with that." Every one of those responses returned 200 OK in under 800ms. None of them violated the SLA. The SLA covered the failure mode you do not actually have.

This is the gap nobody priced into the procurement conversation. The vendor sells availability — a request-level promise that the API answered in time — and the product team consumes capability, which is a request-level promise that the answer was usable. The two are not the same metric, and the team that confuses them is one quiet model bump away from learning the difference.

The Fallback That Became the Default: Why Your Tier Mix Needs an SLO

· 11 min read
Tian Pan
Software Engineer

The dashboard says the fallback fires on 0.5% of requests. The dashboard has been saying that for six months. Then someone re-runs telemetry from scratch and finds the secondary model is serving 38% of traffic and the canned-response tier is serving another 9%. The frontier-model "primary path" the team has been talking about in roadmap reviews is, in fact, the minority experience. Nobody noticed because no single alert ever fired — every demotion was a small, well-justified, locally correct decision, and the cumulative drift never crossed any threshold someone had thought to set.

This is the failure mode I want to name: the fallback that became the default. It is not an outage. It is not a regression in any single component. It is a slow rotation of the product surface where the degraded path stops being a safety net and starts being the experience. The team's mental model and production reality drift apart, and the gap is invisible because the only meters in place are designed to detect failure, not to detect mix.

I'll claim something stronger: if your AI feature has more than two tiers of service, your tier mix is itself an SLO, and if you aren't measuring it, you don't actually know what you ship.

Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

· 11 min read
Tian Pan
Software Engineer

The image is a photograph of a red octagonal stop sign. Someone has stuck a small sticker over the word in the middle that reads "YIELD." You ask the multimodal model: "What does this sign say?" The model answers: "The sign instructs drivers to yield to oncoming traffic at the intersection." Confident, fluent, and loyal to neither the visual evidence nor the textual evidence. It is a hybrid that splits the difference between channels that disagreed about what was true.

This failure mode does not have a settled name yet. Researchers studying multimodal hallucination call it "semantic hallucination," or "cross-modal bias," or "modality dominance," depending on which subfield is writing the paper. Practitioners shipping document AI, screenshot agents, and defect inspection systems run into it every week and describe it in their incident retros as "the model just made something up." It is not made up. It is the predictable output of an architecture that fuses two channels in its final layers without any primitive for representing the case where the channels say different things.

The Regional Model Rollout Lottery: When Your Product Quietly Behaves Differently by Continent

· 11 min read
Tian Pan
Software Engineer

A customer-success email lands on a Friday afternoon: "the model got worse for our German users." The team pulls up the eval dashboard. Scores are flat. Latency p95 is normal. The model name in the config is the same one shipped three weeks ago. Nothing changed. Except something did. The US endpoint quietly received the new model generation last sprint, the EU endpoint is still on the prior version because the provider hasn't completed the regional rollout yet, and the load balancer in front of both has been hiding the gap from every dashboard the team owns.

This is the regional model rollout lottery. Your "single model" abstraction is not single. It bifurcates the moment a provider stages a release across continents — which is most of the time, for most providers, in most years. The version string in your client SDK does not change when this happens. Your traces look identical. Your contract with the provider does not promise otherwise. And your eval suite, the artifact you trust to catch behavioral regressions, is almost certainly running from a CI box that lives in one region and hits whichever endpoint is geographically closest.

Tool Behavior Drift: The Schema Held, the Semantics Didn't

· 11 min read
Tian Pan
Software Engineer

Your contract tests are green. The schema validator is happy. The tool returns the same shape it did last quarter. And the user-facing answer has been quietly wrong for six weeks.

This is the failure mode that contract testing was never designed to catch. Contract tests verify that the wire format hasn't changed — that search() still returns { results: [{ id, title, score }] }, that create_event still accepts an ISO 8601 string, that the geocoder still emits { lat, lng }. What they don't catch is the moment the search endpoint starts ranking by recency instead of relevance, the calendar API silently snaps your 14:07 start time to 14:00 in the EU region, the geocoder picks a different point inside the same ambiguous polygon, or the LLM-classifier-as-a-tool is upgraded to a new model behind a stable endpoint and the false-positive rate moves four points in a category your eval set never sampled. The schema held. The behavior didn't. Your agent kept reading green checkmarks and produced regressed answers no error log captured.

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

· 10 min read
Tian Pan
Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

Wall-Clock Deadline Drift: Why Your Agent Thinks It Has Time It Doesn't

· 9 min read
Tian Pan
Software Engineer

A user clicks send. The agent is configured with a thirty-second budget. The planner inspects the task, sees a deep-research path that takes about twelve seconds and a quick lookup that takes three, and confidently picks the deep path because "we have plenty of time." Twenty-eight seconds later the response lands, two seconds past the SLA the team published last quarter. The dashboard says the agent's reasoning was correct. The retry logic was correct. The tool calls succeeded. Nobody can explain why the user's spinner sat for forty-six seconds.

The bug is not in any single component. It is in the seam between them, in a value the system never thought to refresh: the agent's belief about how much time is left. Somewhere between request acceptance and the model's next planning step, a transparent retry happened, the wall clock advanced, and the deadline metadata didn't. The model is now reasoning about a budget it cashed out fifteen seconds ago and doesn't know it.

Fallback Path Atrophy: Your Graceful Degradation Stopped Working Three Months Ago

· 9 min read
Tian Pan
Software Engineer

The fallback path you wrote nine months ago — the one that catches model timeouts, swaps to a cheaper provider, returns a templated message when both are down — has not actually run in production for the last twelve weeks. It was exercised once during the original launch, the integration tests still pass against it, and the runbook still references it. None of that means it works. A refactor in week six changed the shape of the upstream context object. A library bump in week nine quietly moved a config key. The code still compiles. The tests still pass because they were written against the same stale fixtures the code was. The next time your primary path 504s, your "graceful degradation" will throw a NullPointerException into a user's face, and the postmortem will note — for the third time this year — that the fallback was never re-tested after the upstream contract changed.

This is the quiet failure mode of resilience engineering in AI systems. The fallback path is the part of your application that exists specifically to be ignored. Production traffic flows around it for ninety-nine days out of a hundred. CI never exercises it because no test was ever wired to. The team that owns it forgets it exists between incidents. Then on day one hundred, when the primary model provider has a regional outage and you finally need it, the path bit-rots in front of a paying customer.

Hidden SDK Retries: Why You're Paying Twice and Don't Know It

· 10 min read
Tian Pan
Software Engineer

Open the OpenAI Python SDK source and you will find a quiet line: DEFAULT_MAX_RETRIES = 2. The Anthropic SDK ships the same default. Most TypeScript SDKs match. Two retries, exponential backoff, automatic on connection errors, 408, 409, 429, and any 5xx — fired before your code ever sees the failure. You do not configure this. You do not opt in. You usually do not know it is happening, because the metric your app records is request_count, not attempt_count, and the only span your tracer ever sees is the outer one the SDK closes after the final attempt.

This is fine, mostly, until it is not. Add an application-level retry decorator on top of that SDK call — the kind every team writes after their first 429 — and you have built a 3×3 storm: the SDK tries three times, your wrapper tries three times around the SDK, and a single user request fans out to nine inference calls during a provider degradation. The provider's bill counts every attempt. Your dashboards count one. The reconciliation, when someone finally runs it, is a quarter-end conversation nobody enjoys.

AI Feature Dependency Graphs: Resilience Engineering When Your Services Share a Model

· 10 min read
Tian Pan
Software Engineer

Your embedding model goes down at 3 PM on a Tuesday. Within thirty seconds, your support chat stops answering questions, your personalized recommendation engine starts returning empty results, your document search returns nothing, and your onboarding assistant stops working. Your on-call engineer opens the incident channel and sees fifteen simultaneous alerts from features that have no visible relationship to each other. There is no stack trace pointing to the root cause. It looks like a distributed systems outage — but it isn't. It's a single shared dependency failing, and you didn't know fifteen features shared it.

This is the AI feature dependency problem: the infrastructure layer underneath your product features is deeply interconnected, but your architecture diagrams show each feature as an isolated box. When the coupling is invisible, failure propagation is invisible too — until it isn't.

AI Output Volatility Is a Business Risk You're Probably Underpricing

· 9 min read
Tian Pan
Software Engineer

When companies talk about AI risk, the conversation usually gravitates toward the obvious failures: hallucinated facts, biased outputs, legal liability from generated content. What gets far less attention is a quieter structural problem: you've made commercial commitments — pricing tiers, SLAs, customer-facing accuracy claims — on top of a system whose outputs are inherently probabilistic. Every time the model generates a response, it's sampling from a distribution. The contract doesn't mention distributions.

This is a business risk that most teams discover late, when a customer complains that the same document review workflow gave completely different results on Monday and Friday. Or when a regulator asks for reproducibility guarantees that the system architecturally cannot provide.