Skip to main content

2 posts tagged with "resilience"

View all tags

Your Fallback Path Is the Only Untested Code in Production

· 9 min read
Tian Pan
Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

Backpressure Patterns for LLM Pipelines: Why Exponential Backoff Isn't Enough

· 10 min read
Tian Pan
Software Engineer

During peak usage, some LLM providers experience failure rates exceeding 20%. When your system hits that wall and responds by doubling its wait time and retrying, you are solving the wrong problem. Exponential backoff handles a single call's resilience. It does nothing for the system as a whole — nothing for wasted tokens, nothing for connection pool exhaustion, nothing for the 50 other requests queued behind the one that just got a 429.

The traffic patterns hitting LLM APIs have also changed fundamentally. Simple sub-100-token queries dropped from 80% to roughly 20% of traffic between 2023 and 2025, while requests over 500 tokens became the consistent majority. Agentic workflows chain 10–20 sequential calls in rapid bursts, generating traffic patterns that look indistinguishable from a DDoS attack under traditional request-per-minute rate limits. The infrastructure built for REST APIs with predictable payloads is not the infrastructure you need for LLM pipelines.