Skip to main content

The LLM Request Lifecycle Is a State Machine — Treat It Like One

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM request handling as a linear function: call the API, check for an exception, maybe retry once, return the result. In practice it's nothing like that. Between the moment a user triggers an LLM call and the moment a response reaches their screen, a request can traverse a dozen implicit states — attempting primary provider, waiting for backoff, switching to fallback, validating output, retrying with refined prompt — without any of those transitions being recorded or visible.

The result is debugging that happens after the fact from logs scattered across services, with no authoritative answer to "what did this request actually do?" Treating the LLM request lifecycle as an explicit finite state machine is the architectural move that makes that question answerable without archaeological work.

Why Implicit States Are a Production Liability

Consider a typical LLM call wrapper in production. It probably has:

  • A retry loop with exponential backoff for transient errors
  • A fallback to a secondary model when the primary returns a 503 or hits a rate limit
  • A validation step that checks whether the output matches the expected schema
  • A re-prompt path that rewrites the request when validation fails

Each of these is a state. Each transition between them is an event. But most implementations encode all of this as nested if-else logic inside a single function, with ad hoc logging that captures some transitions and misses others.

The practical consequences show up in three recurring failure modes:

Silent fallbacks. The primary model fails, the system routes to the secondary, and the user gets a response — but the response has subtly worse quality, higher latency, or higher cost than usual. Because no one logged the fallback transition explicitly, this pattern is invisible in dashboards until someone notices something wrong downstream.

Retry storms in disguise. When a provider experiences degraded performance rather than hard errors, retries succeed on the third or fourth attempt and appear as "slow requests" rather than "failures." The system looks fine from the outside. But aggregate retry costs accumulate, and the 90th percentile latency increases silently over weeks.

Validation failures misclassified as successes. The API returns HTTP 200 with a response that fails schema validation. The wrapper re-prompts once, gets a valid response, and returns it. The caller sees a success. The fact that this request required two LLM calls instead of one, took twice as long, and cost twice as much is unrecorded.

All three of these failures have the same root cause: the intermediate states are invisible.

Mapping the Actual States

A useful model defines eight states for any LLM request:

  1. PENDING — the request is queued but hasn't been dispatched
  2. DISPATCHED — the request has been sent to a specific provider/model
  3. AWAITING_RESPONSE — waiting for the stream to begin or the response to arrive
  4. VALIDATING — the response arrived; output is being checked
  5. RETRYING — the previous attempt failed; backoff timer is running
  6. FALLING_BACK — the primary path is considered unavailable; routing to secondary
  7. CIRCUIT_OPEN — the circuit breaker has tripped; requests fail fast without attempting the provider
  8. TERMINAL — the request reached a final outcome (SUCCESS, VALIDATION_FAILURE, EXHAUSTED, or DEGRADED)

The transitions between these states are events, not log messages. Each transition has a cause (what triggered it), a target state (where we're going), and a cost (latency added, tokens consumed, provider change).

What makes this machine explicit rather than implicit is recording every state entry and exit as a structured event, not as a side effect of log.info scattered through the retry loop.

Separating Three Distinct Failure Categories

One reason implicit implementations get complicated is that engineers conflate three failure types that actually require different handling:

Transient infrastructure failures are short-lived and self-resolving: rate limit 429s, brief 503s, network timeouts, TLS handshakes that time out under load. The correct response is to wait and retry the same provider. The wait time should start at one to two seconds, double with each attempt, and include random jitter to avoid thundering herds. A reasonable cap is three to five retries before escalating.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates