182 posts tagged with "reliability"

The AI Gateway Is the SPOF Nobody Named

May 14, 2026 · 10 min read

Software Engineer

The pitch sounded responsible. "Let's not hardcode OpenAI everywhere — we'll put a thin abstraction in front, then we can swap providers if we need to." Two years later, that thin abstraction is a service with its own deploy pipeline, its own SRE on-call, an eval gate that blocks bad prompts, a semantic cache that saves seven figures a year, a retry policy with provider-specific backoffs, an observability schema every dashboard depends on, and a key vault holding the credentials for six model vendors. Every AI feature in the company terminates there.

It is also, almost by accident, the single point of failure with the worst blast radius in the stack. When the primary LLM provider goes down — and in 2025 OpenAI was tracked having 294 outage events since January, with Anthropic logging 184.5 hours of total customer impact in December alone — the gateway routes around it and most users never notice. When the gateway itself dies, every AI feature in every product simultaneously stops, the failover that was supposed to fire never gets a chance, and the postmortem opens with "the abstraction layer we built to insulate us from provider outages was the outage."

Time-of-Day Quality Drift: Why Your AI Feature Behaves Differently at 10 AM ET

May 14, 2026 · 9 min read

Tian Pan

Software Engineer

Your eval suite ran green at 2 AM PT on a quiet provider. QA smoke-tested at 11 PM the night before launch. The feature goes live, and by Tuesday at 10 AM Eastern your p95 is 40% higher than the dashboard you signed off on, your agent is dropping the last tool call in a six-step plan, and your support inbox is filling with tickets that all sound the same: "the AI was weird this morning." Nobody is wrong. The model is also not wrong. The eval set is wrong — it never saw a saturated provider, so it has no opinion on what the feature does when the queue depth triples and the deadline budget collapses.

Provider load is not a latency problem with a quality side effect. It is a distribution shift in the inputs your model and your agent loop receive, and you have built every quality signal you trust on the wrong half of that distribution. The fix is not a faster region or a better model. The fix is to stop pretending your eval harness is sampling from the same world your users are.

Agent Circuit Breakers: Why Step Budgets Are Fuses, Not Breakers

May 13, 2026 · 12 min read

Tian Pan

Software Engineer

Every team that ships agents to production eventually wakes up to the same kind of incident. An agent enters a state it cannot exit. It re-calls the same tool with cosmetically different arguments for six hours. It oscillates between two plans whose preconditions reject each other. It retries a transient 429 every two hundred milliseconds until morning. It generates a million-token plan it never executes. By the time anyone notices, the token bill is four figures, the downstream API is rate-limited, the customer's session has timed out twelve times, and the on-call engineer is being paged by three different alerts about the same root cause.

The first fix every team reaches for is a step-count budget. Cap the agent at twenty iterations. Cap it at fifty. Pick a number and ship. The step budget makes the incident reports stop, but it does not make the underlying problem go away — and once you understand the mechanism, you can see why a step budget is the agent equivalent of a household fuse: it blows after the damage has been done, the fuse box itself is now a maintenance burden, and the next time something melts, your reflex is to swap in a higher-rated fuse rather than ask what is actually shorting.

The Self-Critique Tax: When Asking the Model to Check Its Own Work Costs Double for Modest Wins

May 13, 2026 · 11 min read

Tian Pan

Software Engineer

A team ships a self-critique loop into production because the benchmark numbers looked irresistible: Self-Refine reported a 20 percent absolute improvement averaged across seven tasks, Chain-of-Verification cut hallucinations by 50 to 70 percent on QA workloads, and reflection prompts pushed math-equation accuracy up 34.7 percent in one widely-cited paper. A month later the finance review surfaces the bill. The product's per-request cost has roughly tripled, p99 latency is up by a factor of three, and the actual quality lift that survived contact with production traffic is closer to three percent than thirty. The self-critique loop is doing exactly what it advertised. The team just never priced it.

This is the self-critique tax: a reliability pattern that reads like a free quality win on a slide and reads like a structural cost increase on an invoice. The pattern itself is sound — there are real cases where generate-then-verify is the right answer. The failure mode is shipping it as a default instead of as a calibrated intervention, and discovering at the wrong time of the quarter that "the model checks its own work" was actually a procurement decision.

Two-Hop Tool Chains: Why 95% Tools Compose Into 80% Pipelines

May 13, 2026 · 10 min read

Tian Pan

Software Engineer

The per-tool dashboard in your observability stack tells a comforting lie. search_listings is green at 96%. book_appointment is green at 95%. The agent that uses them back-to-back has been at 78% for three weeks and nobody can explain why. The reason isn't in either tool. It's in the seam between them — the place no dashboard panel exists.

Composition is not addition. When tool A's output flows into tool B's input, the failure surface isn't 1 - (0.96 × 0.95) against B's narrow definition of "valid call." It's the full cartesian product of every way A can be subtly off by B's standards: a date string in MM/DD/YYYY when B expects ISO 8601, a price returned in cents when B parses dollars, a paginated cursor that points one item past the last result, an entity ID that was renamed on the upstream service yesterday. Any of these passes A's own contract tests cleanly. Each one breaks B. The team's per-tool reliability metrics never see it because each tool is, by its own standards, fine.

The Vendor SLA Gap: Why Your LLM Provider's Uptime Misses the Failure Mode That Breaks Your Product

May 13, 2026 · 9 min read

Tian Pan

Software Engineer

Your LLM provider says 99.95% availability. Your status page is green. Your latency dashboard is in the SLO. Your product is broken anyway — the assistant started refusing routine requests this morning, the JSON outputs that powered the downstream parser shifted from compact to chatty, and a third of the support tickets you triage with a model are coming back with "I can't help with that." Every one of those responses returned 200 OK in under 800ms. None of them violated the SLA. The SLA covered the failure mode you do not actually have.

This is the gap nobody priced into the procurement conversation. The vendor sells availability — a request-level promise that the API answered in time — and the product team consumes capability, which is a request-level promise that the answer was usable. The two are not the same metric, and the team that confuses them is one quiet model bump away from learning the difference.

The Fallback That Became the Default: Why Your Tier Mix Needs an SLO

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

The dashboard says the fallback fires on 0.5% of requests. The dashboard has been saying that for six months. Then someone re-runs telemetry from scratch and finds the secondary model is serving 38% of traffic and the canned-response tier is serving another 9%. The frontier-model "primary path" the team has been talking about in roadmap reviews is, in fact, the minority experience. Nobody noticed because no single alert ever fired — every demotion was a small, well-justified, locally correct decision, and the cumulative drift never crossed any threshold someone had thought to set.

This is the failure mode I want to name: the fallback that became the default. It is not an outage. It is not a regression in any single component. It is a slow rotation of the product surface where the degraded path stops being a safety net and starts being the experience. The team's mental model and production reality drift apart, and the gap is invisible because the only meters in place are designed to detect failure, not to detect mix.

I'll claim something stronger: if your AI feature has more than two tiers of service, your tier mix is itself an SLO, and if you aren't measuring it, you don't actually know what you ship.

Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

The image is a photograph of a red octagonal stop sign. Someone has stuck a small sticker over the word in the middle that reads "YIELD." You ask the multimodal model: "What does this sign say?" The model answers: "The sign instructs drivers to yield to oncoming traffic at the intersection." Confident, fluent, and loyal to neither the visual evidence nor the textual evidence. It is a hybrid that splits the difference between channels that disagreed about what was true.

This failure mode does not have a settled name yet. Researchers studying multimodal hallucination call it "semantic hallucination," or "cross-modal bias," or "modality dominance," depending on which subfield is writing the paper. Practitioners shipping document AI, screenshot agents, and defect inspection systems run into it every week and describe it in their incident retros as "the model just made something up." It is not made up. It is the predictable output of an architecture that fuses two channels in its final layers without any primitive for representing the case where the channels say different things.

The Regional Model Rollout Lottery: When Your Product Quietly Behaves Differently by Continent

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

A customer-success email lands on a Friday afternoon: "the model got worse for our German users." The team pulls up the eval dashboard. Scores are flat. Latency p95 is normal. The model name in the config is the same one shipped three weeks ago. Nothing changed. Except something did. The US endpoint quietly received the new model generation last sprint, the EU endpoint is still on the prior version because the provider hasn't completed the regional rollout yet, and the load balancer in front of both has been hiding the gap from every dashboard the team owns.

This is the regional model rollout lottery. Your "single model" abstraction is not single. It bifurcates the moment a provider stages a release across continents — which is most of the time, for most providers, in most years. The version string in your client SDK does not change when this happens. Your traces look identical. Your contract with the provider does not promise otherwise. And your eval suite, the artifact you trust to catch behavioral regressions, is almost certainly running from a CI box that lives in one region and hits whichever endpoint is geographically closest.

Tool Behavior Drift: The Schema Held, the Semantics Didn't

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

Your contract tests are green. The schema validator is happy. The tool returns the same shape it did last quarter. And the user-facing answer has been quietly wrong for six weeks.

This is the failure mode that contract testing was never designed to catch. Contract tests verify that the wire format hasn't changed — that search() still returns { results: [{ id, title, score }] }, that create_event still accepts an ISO 8601 string, that the geocoder still emits { lat, lng }. What they don't catch is the moment the search endpoint starts ranking by recency instead of relevance, the calendar API silently snaps your 14:07 start time to 14:00 in the EU region, the geocoder picks a different point inside the same ambiguous polygon, or the LLM-classifier-as-a-tool is upgraded to a new model behind a stable endpoint and the false-positive rate moves four points in a category your eval set never sampled. The schema held. The behavior didn't. Your agent kept reading green checkmarks and produced regressed answers no error log captured.

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

May 10, 2026 · 10 min read

Tian Pan

Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

Wall-Clock Deadline Drift: Why Your Agent Thinks It Has Time It Doesn't

May 10, 2026 · 9 min read

Tian Pan

Software Engineer

A user clicks send. The agent is configured with a thirty-second budget. The planner inspects the task, sees a deep-research path that takes about twelve seconds and a quick lookup that takes three, and confidently picks the deep path because "we have plenty of time." Twenty-eight seconds later the response lands, two seconds past the SLA the team published last quarter. The dashboard says the agent's reasoning was correct. The retry logic was correct. The tool calls succeeded. Nobody can explain why the user's spinner sat for forty-six seconds.

The bug is not in any single component. It is in the seam between them, in a value the system never thought to refresh: the agent's belief about how much time is left. Somewhere between request acceptance and the model's next planning step, a transparent retry happened, the wall clock advanced, and the deadline metadata didn't. The model is now reasoning about a budget it cashed out fifteen seconds ago and doesn't know it.

About Tian Pan