The LLM Request Lifecycle Is a State Machine — Treat It Like One

April 10, 2026 · 9 min read

Software Engineer

Most teams treat LLM request handling as a linear function: call the API, check for an exception, maybe retry once, return the result. In practice it's nothing like that. Between the moment a user triggers an LLM call and the moment a response reaches their screen, a request can traverse a dozen implicit states — attempting primary provider, waiting for backoff, switching to fallback, validating output, retrying with refined prompt — without any of those transitions being recorded or visible.

The result is debugging that happens after the fact from logs scattered across services, with no authoritative answer to "what did this request actually do?" Treating the LLM request lifecycle as an explicit finite state machine is the architectural move that makes that question answerable without archaeological work.

Why Implicit States Are a Production Liability

Consider a typical LLM call wrapper in production. It probably has:

A retry loop with exponential backoff for transient errors
A fallback to a secondary model when the primary returns a 503 or hits a rate limit
A validation step that checks whether the output matches the expected schema
A re-prompt path that rewrites the request when validation fails

Each of these is a state. Each transition between them is an event. But most implementations encode all of this as nested if-else logic inside a single function, with ad hoc logging that captures some transitions and misses others.

The practical consequences show up in three recurring failure modes:

Silent fallbacks. The primary model fails, the system routes to the secondary, and the user gets a response — but the response has subtly worse quality, higher latency, or higher cost than usual. Because no one logged the fallback transition explicitly, this pattern is invisible in dashboards until someone notices something wrong downstream.

Retry storms in disguise. When a provider experiences degraded performance rather than hard errors, retries succeed on the third or fourth attempt and appear as "slow requests" rather than "failures." The system looks fine from the outside. But aggregate retry costs accumulate, and the 90th percentile latency increases silently over weeks.

Validation failures misclassified as successes. The API returns HTTP 200 with a response that fails schema validation. The wrapper re-prompts once, gets a valid response, and returns it. The caller sees a success. The fact that this request required two LLM calls instead of one, took twice as long, and cost twice as much is unrecorded.

All three of these failures have the same root cause: the intermediate states are invisible.

Mapping the Actual States

A useful model defines eight states for any LLM request:

PENDING — the request is queued but hasn't been dispatched
DISPATCHED — the request has been sent to a specific provider/model
AWAITING_RESPONSE — waiting for the stream to begin or the response to arrive
VALIDATING — the response arrived; output is being checked
RETRYING — the previous attempt failed; backoff timer is running
FALLING_BACK — the primary path is considered unavailable; routing to secondary
CIRCUIT_OPEN — the circuit breaker has tripped; requests fail fast without attempting the provider
TERMINAL — the request reached a final outcome (SUCCESS, VALIDATION_FAILURE, EXHAUSTED, or DEGRADED)

The transitions between these states are events, not log messages. Each transition has a cause (what triggered it), a target state (where we're going), and a cost (latency added, tokens consumed, provider change).

What makes this machine explicit rather than implicit is recording every state entry and exit as a structured event, not as a side effect of log.info scattered through the retry loop.

Separating Three Distinct Failure Categories

One reason implicit implementations get complicated is that engineers conflate three failure types that actually require different handling:

Transient infrastructure failures are short-lived and self-resolving: rate limit 429s, brief 503s, network timeouts, TLS handshakes that time out under load. The correct response is to wait and retry the same provider. The wait time should start at one to two seconds, double with each attempt, and include random jitter to avoid thundering herds. A reasonable cap is three to five retries before escalating.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The LLM Request Lifecycle Is a State Machine — Treat It Like One

Why Implicit States Are a Production Liability

Mapping the Actual States

Separating Three Distinct Failure Categories

Recommended Reading

About Tian Pan

Why Implicit States Are a Production Liability​

Mapping the Actual States​

Separating Three Distinct Failure Categories​

Recommended Reading

About Tian Pan

Why Implicit States Are a Production Liability

Mapping the Actual States

Separating Three Distinct Failure Categories